{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Foundations of Machine Learning - Session 04\n",
"\n",
"- *Course*: Foundations of Machine Learning\n",
"- *Session*: 04\n",
"- *Unit*: Bayes Classifier\n",
"\n",
"This notebook develops a Naive Bayes classifier in Numpy."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data\n",
"\n",
"For this notebooks' classifiction task, we will utilize text data for the first time. The code block below is pre-filled to load the Twenty Newsgroups dataset, which comprises newsgroups posts on 20 topics. One is displayed below for reference.\n",
"\n",
"To make the text data machine-readable, we transform each post into a vector of token counts. The dataset is thus transformed into a matrix of size `(n_samples, n_terms)` where `n_samples` is the number of posts and `n_terms` is the size of the vocabulary. Each row of the matrix is a feature vector, and each cell $(i,j)$ denotes how often term $j$ occurs in document $i$."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"---------- rec.autos ----------\n",
"I was wondering if anyone out there could enlighten me on this car I saw\n",
"the other day. It was a 2-door sports car, looked to be from the late 60s/\n",
"early 70s. It was called a Bricklin. The doors were really small. In addition,\n",
"the front bumper was separate from the rest of the body. This is \n",
"all I know. If anyone can tellme a model name, engine specs, years\n",
"of production, where this car is made, history, or whatever info you\n",
"have on this funky looking car, please e-mail.\n"
]
}
],
"source": [
"from sklearn.datasets import fetch_20newsgroups\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"data = fetch_20newsgroups(remove=[\"headers\", \"footers\", \"quotes\"])\n",
"\n",
"# CountVectorizer represents each text as vector of token counts\n",
"vectorizer = CountVectorizer(max_features=2**15)\n",
"\n",
"# Assemble D\n",
"D = list(zip(\n",
" vectorizer.fit_transform(data.data).toarray(),\n",
" data.target\n",
"))\n",
"\n",
"print(\"-\"*10, data.target_names[data.target[0]], \"-\"*10)\n",
"print(data.data[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another new concept compared to the last week is train/test splits. To make sure that the classifier generalizes well, we want to test its performance on data it has never encountered before. Therefore, we reserve some data for testing, and only use a subset for training.\n",
"\n",
"**Exercise**: split $D$ into two random subsets, a training set $D_\\text{train}$ which contains 80% of the data and a test set $D_\\text{test}$ which contains 20% of the data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Shuffle D\n",
"np.random.shuffle(D)\n",
"\n",
"# 80:20 split into train and test set\n",
"D_train = D[:int(len(D) * 0.8)]\n",
"D_test = D[int(len(D) * 0.8):]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Naive Bayes Classification\n",
"\n",
"Recall the definition of the naive bayes classifier from lecture slide [ML:VII-84]():\n",
"$$\n",
"\\begin{aligned}\n",
"P(A_i \\mid B_1,...,B_p) &= \\frac{P(A_i) \\cdot P(B_1, ..., B_p \\mid A_i)}{P(B_1, ..., B_p)} \\\\\n",
"\t\t\t\t\t\t&= \\frac{P(A_i)\\cdot\\prod^{p}_{j=1} P(B_j \\mid A_i)}{\\sum^{k}_{i=1}P(A_i)\\cdot\\prod^{p}_{j=1} P(B_j\\mid A_i)}\n",
"\\end{aligned}\n",
"$$\n",
"\n",
"Here, $P(A_i)$ is the prior probability of class $A_i$, and $P(B_j\\mid A_i)$ is the posterior probability of feature $B_j$ given $A_i$. To build a bayes classifier, we first have to infer both of these probabilities from the training data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Estimating Prior Probabilities\n",
"\n",
"**Exercise**: compute the prior probability for every class in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "array([0.04342062, 0.05259087, 0.05192796, 0.05082311, 0.05038117,\n 0.05270136, 0.04916584, 0.05281184, 0.05413766, 0.05115457,\n 0.0531433 , 0.05325378, 0.05292233, 0.04993923, 0.05181748,\n 0.05413766, 0.04938681, 0.05203845, 0.04176334, 0.0324826 ])"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def priors(D):\n",
" \"\"\"\n",
" Computes the prior probabilities P(A_i) from D.\n",
"\n",
" :param D: training dataset to estimate prior probabilities from\n",
" :return: one-dimensional array of size (n_classes) where each cell denotes the prior probability P(A_i) of class A_i\n",
" \"\"\"\n",
" class_names, counts = np.unique(np.array([x[1] for x in D]), return_counts=True)\n",
" return np.array(counts / counts.sum())\n",
"\n",
"priors(D_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Estimating Posterior Probabilities\n",
"\n",
"**Exercise**: compute the posterior probabilities for every feature/class combination in the dataset."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "array([[0.00000000e+00, 1.23728348e-04, 0.00000000e+00, ...,\n 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],\n [1.79188532e-04, 1.27991809e-04, 0.00000000e+00, ...,\n 0.00000000e+00, 2.55983617e-05, 0.00000000e+00],\n [1.18564394e-04, 2.09231284e-05, 0.00000000e+00, ...,\n 7.67181376e-05, 0.00000000e+00, 1.25538771e-04],\n ...,\n [4.59335669e-05, 7.54622885e-04, 0.00000000e+00, ...,\n 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],\n [7.75531966e-05, 3.78071834e-04, 0.00000000e+00, ...,\n 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],\n [1.74975066e-05, 1.04985040e-04, 0.00000000e+00, ...,\n 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def posteriors(D):\n",
" \"\"\"\n",
" Computes the posterior probabilities P(B_j | A_i) for all classes A_i and all features B_j\n",
"\n",
" :param D: Dataset to estimate probabilities from\n",
" :return: two-dimensional array of size (n_classes, n_features) where cell (i, j) denotes P(B_j | A_i)\n",
" \"\"\"\n",
" # Infer Classes from D\n",
" classes = np.array([x[1] for x in D])\n",
" # Assemble feature matrix from D for more efficient lookup\n",
" feature_matrix = np.array([x[0] for x in D])\n",
"\n",
" # Number of classes\n",
" n_classes = np.unique(classes).shape[0]\n",
" # Number of features\n",
" n_features = feature_matrix.shape[1]\n",
"\n",
" # Empty array to hold posterior probabilities\n",
" posterior_prob = np.zeros(shape=(n_classes, n_features))\n",
"\n",
" for i in range(n_classes):\n",
" posterior_prob[i] = feature_matrix[classes == i].sum(axis=0)\n",
"\n",
" return posterior_prob / posterior_prob.sum(axis=1).reshape(-1, 1)\n",
"\n",
"posteriors(D_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Assembling the classifier\n",
"\n",
"Given both prior and posterior probabilities, we can implement the bayes classifier using the formula given previously. However, one problem might occur: if a term never occurs in a class ($P(B_j | A_i) = 0$, no matter what other features occur in a sample, the class prediction will always be 0. Therefore, we add a small value $\\alpha$ to every probability to avoid multiplying by 0.\n",
"\n",
"**Exercise**: implement a Bayes classifier following the formula from the lecture.\n",
"\n",
"*Hint*: add $\\alpha$ to the prior and posterior probabilities to increase robustness."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": "array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,\n 0.00000000e+000, 0.00000000e+000, 0.00000000e+000],\n [2.99626552e-017, 1.11797579e-018, 3.93977652e-020, ...,\n 2.36481467e-017, 2.19312394e-017, 2.71815906e-017],\n [0.00000000e+000, 8.73751584e-315, 8.64614880e-322, ...,\n 0.00000000e+000, 0.00000000e+000, 0.00000000e+000],\n ...,\n [1.95055917e-147, 1.55166015e-149, 1.51838170e-156, ...,\n 4.41945632e-147, 1.26475326e-141, 7.76361046e-146],\n [1.93326286e-217, 3.76059881e-210, 1.38830679e-230, ...,\n 6.33012564e-218, 9.11744062e-217, 5.84279718e-220],\n [4.87345516e-293, 4.14499536e-270, 9.61455328e-287, ...,\n 1.47789228e-293, 1.77639158e-294, 4.58845314e-301]])"
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def predict(D, priors, posteriors, alpha=1e-6, return_probs=False):\n",
" \"\"\"\n",
" Predicts classes for test dataset D.\n",
"\n",
" :param D: test dataset to predict classes for\n",
" :param priors: prior probabilities from the train dataset\n",
" :param posteriors: posterior probabilities from the test dataset\n",
" :param alpha: alpha-value to add to the probability inputs\n",
" :param return_probs: whether to return raw probabilities (True) or just the most likely class label (False)\n",
" :return: if returning raw probabilities, returns array of size (n_samples, n_classes) where each cell denotes the class probability for that sample; if returning class labels, returns array of size (n_samples), where each cell is the most likely class of that sample\n",
" \"\"\"\n",
" n_classes = posteriors.shape[0]\n",
" n_samples = len(D)\n",
" predictions = np.zeros(shape=(n_samples, n_classes))\n",
" # Adding alpha to deal with terms that never occur for a class\n",
" # Otherwise, documents containing that term could never be assigned that class (product reduces to 0)!\n",
" priors = priors + alpha\n",
" posteriors = posteriors + alpha\n",
"\n",
" for i, (x, _) in enumerate(D):\n",
" term_mask = x.astype(bool)\n",
" predictions[i] = np.prod(posteriors[:, term_mask], axis=1) * priors\n",
" prob_sum = predictions[i].sum()\n",
" predictions[i] /= prob_sum\n",
"\n",
" if return_probs:\n",
" return predictions\n",
" else:\n",
" return np.argmax(predictions, axis=1)\n",
"\n",
"\n",
"predict(D_test, priors(D_train), posteriors(D_train), return_probs=True)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Investigate the output of the classifier for $D_\\text{test}$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Rebuilding the classifier in log space\n",
"\n",
"We can observe that some probabilities are really, really small! This might become a problem due to our naive implementation: since we multiply many probabilities, the value eventually underflows, i.e. is smaller than the smallest float64-representable number. Thus, there is no discernible difference to 0 anymore.\n",
"\n",
"The solution is implement our classifier using [log probabilities](https://en.wikipedia.org/wiki/Log_probability) instead. Simply transform the probabilities to log space and replace mathematical operations in your implementation using the log-space equivalents. For convenience, `log_add` is implemented below. If you are new to log probabilities, give the linked resources a read."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def log_add(a, b):\n",
" # Addition in log space, see https://en.wikipedia.org/wiki/Log_probability#Addition_in_log_space for derivation of formula\n",
" return a + np.log1p(np.exp(b-a))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Exercise**: adapt your `predict` implementation to log-space.\n",
"\n",
"*Hint*:\n",
"- `reduce` might come in handy when constructing sums of log-probability arrays.\n",
"- do not forget to cast back the final predictions to real space using `np.exp`"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/var/folders/54/8lplw84d1sbgcm1wmd2rgyd40000gn/T/ipykernel_26832/1611980057.py:3: RuntimeWarning: overflow encountered in exp\n",
" return a + np.log1p(np.exp(b-a))\n"
]
}
],
"source": [
"from functools import reduce\n",
"\n",
"def predict_log(D, priors, posteriors, alpha=1e-6, return_probs=False):\n",
" \"\"\"\n",
" Predicts classes for test dataset D in log space.\n",
"\n",
" :param D: test dataset to predict classes for\n",
" :param priors: prior probabilities from the train dataset\n",
" :param posteriors: posterior probabilities from the test dataset\n",
" :param alpha: alpha-value to add to the probability inputs\n",
" :param return_probs: whether to return raw probabilities (True) or just the most likely class label (False)\n",
" :return: if returning raw probabilities, returns array of size (n_samples, n_classes) where each cell denotes the class probability for that sample; if returning class labels, returns array of size (n_samples), where each cell is the most likely class of that sample\n",
" \"\"\"\n",
" n_classes = posteriors.shape[0]\n",
" n_samples = len(D)\n",
" predictions = np.zeros(shape=(n_samples, n_classes))\n",
"\n",
" # Cast priors and posteriors to log space to avoid underflow\n",
" # Adding alpha to deal with terms that never occur for a class\n",
" priors = np.log(priors + alpha)\n",
" posteriors = np.log(posteriors + alpha)\n",
"\n",
" for i, (x, _) in enumerate(D):\n",
" # Construct term mask to only deal with observed terms\n",
" term_mask = x.astype(bool)\n",
" # Sum instead of multiplying since we are in log space!\n",
" predictions[i] = np.sum(posteriors[:, term_mask], axis=1) + priors\n",
" # log_add instead of + since we are in log space!\n",
" log_prob_sum = reduce(log_add, predictions[i, :])\n",
" # Minus instead of / since we are in log space!\n",
" predictions[i] -= log_prob_sum\n",
" # Cast back to real space and return\n",
" if return_probs:\n",
" return np.exp(predictions)\n",
" else:\n",
" return np.argmax(np.exp(predictions), axis=1)\n",
"\n",
"\n",
"p = predict_log(D_test, priors(D_train), posteriors(D_train))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation\n",
"\n",
"**Exercise**: evaluate your classifier by comparing the predicted labels to the ground-truth labels of $D_\\text{test}$.\n",
"\n",
"*Hint*: you can either use your own evaluation implementation from a previous lab, or use the `sklearn` metric implementations."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" precision recall f1-score support\n",
"\n",
" 0 0.63 0.51 0.57 107\n",
" 1 0.73 0.56 0.63 142\n",
" 2 0.00 0.00 0.00 1\n",
" 3 0.75 0.60 0.67 164\n",
" 4 0.75 0.66 0.70 140\n",
" 5 0.81 0.73 0.77 129\n",
" 6 0.73 0.88 0.80 116\n",
" 7 0.70 0.65 0.68 124\n",
" 8 0.73 0.44 0.55 179\n",
" 9 0.79 0.90 0.84 118\n",
" 10 0.78 0.92 0.85 101\n",
" 11 0.82 0.82 0.82 114\n",
" 12 0.67 0.77 0.71 98\n",
" 13 0.83 0.88 0.86 134\n",
" 14 0.74 0.84 0.79 109\n",
" 15 0.85 0.65 0.74 143\n",
" 16 0.76 0.70 0.73 107\n",
" 17 0.72 0.74 0.73 91\n",
" 18 0.74 0.63 0.68 101\n",
" 19 0.34 0.62 0.44 45\n",
"\n",
" accuracy 0.70 2263\n",
" macro avg 0.69 0.67 0.68 2263\n",
"weighted avg 0.75 0.70 0.72 2263\n",
"\n"
]
}
],
"source": [
"from sklearn.metrics import classification_report\n",
"\n",
"print(classification_report(p, [d[1] for d in D_test]))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"outputs": [],
"source": [],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 1
}