{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Foundations of Machine Learning - Session 02\n", "\n", "- *Course*: Foundations of Machine Learning\n", "- *Session*: 02\n", "- *Unit*: Evaluation Metrics" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "This notebook introduces different metrics for evaluating a classification task. First, the binary case will be considered, and then extended to multiclass-classification." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": true }, "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "# Reference implementations\n", "from sklearn.metrics import accuracy_score as sk_accuracy\n", "from sklearn.metrics import precision_score as sk_precision\n", "from sklearn.metrics import recall_score as sk_recall\n", "from sklearn.metrics import f1_score as sk_f1\n", "from sklearn.metrics import mean_squared_error as sk_mean_squared_error" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Binary Classification\n", "\n", "Consider the following classification setting: a machine learning classifier is built that decides whether a picture depicts a mail is spam or not. Given below are the ground-truth labels for a test set of 10 mails and the prediction made by the classifier. The task is to now decide if the classifier is *correct*." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "y_true = np.array([0,1,1,0,1,0,1,0,0,1])\n", "y_pred = np.array([0,1,0,0,1,0,0,0,1,0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Confusion Matrix for Binary Settings\n", "\n", "The first step for evaluation of a classification task is to build a *confusion matrix*. A confusion matrix counts the for possible kinds of match or divergence we can observe in the true and predicted label for a single sample:\n", "- True Positive (TP): both actual and predicted label are positive (the mail is correctly flagged as spam)\n", "- True Negative (TN): both actual and predicted label are negative (the mail correctly flagged as not spam)\n", "- False Positive (FP): predicted label is positive, but actual label is negative (spam is flagged but the mail was not)\n", "- False Negative (FN): predicted label is negative, but actual label is positive (the mail was spam, but is not flagged)\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Exercise**: complete the function below to calculate the confusion matrix based on the input true and predicted labels.\n", "\n", "*Remark*: the output should be a 2x2 numpy array with values corresponding to the picture above. The function should throw an error if the input arrays are of different shapes." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "def confusion_matrix(y_true: np.array, y_pred: np.array):\n", " assert y_true.shape == y_pred.shape\n", " cm = np.zeros(shape=(2,2), dtype=np.intc)\n", "\n", " # True positives (both labels are 1)\n", " cm[0,0] = np.sum(np.logical_and(y_true == 1, y_pred == 1))\n", " # False positives (actual label is 0, predicted label is 1)\n", " cm[0,1] = np.sum(np.logical_and(y_true == 0, y_pred == 1))\n", " # False negatives (actual label is 1, predicted label is 0)\n", " cm[1,0] = np.sum(np.logical_and(y_true == 1, y_pred == 0))\n", " # True negatives (both labels are 0)\n", " cm[1,1] = np.sum(np.logical_and(y_true == 0, y_pred == 0))\n", "\n", " return cm\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Actual
PositiveNegative
PredictedPositive21
Negative34
\n", "
" ], "text/plain": [ " Actual \n", " Positive Negative\n", "Predicted Positive 2 1\n", " Negative 3 4" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(\n", " confusion_matrix(y_true, y_pred),\n", " columns = pd.MultiIndex.from_tuples([(\"Actual\", \"Positive\"), (\"Actual\", \"Negative\")]),\n", " index = pd.MultiIndex.from_tuples([(\"Predicted\", \"Positive\"), (\"Predicted\", \"Negative\")])\n", ")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Accuracy\n", "\n", "Accuracy is the ratio of true predictions (i.e. the sum of true positives and true negatives) to total samples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**: implement a function to calculate accuracy.\n", " \n", "*Remark*: you can use the values from the contingency matrix." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def accuracy_score(y_true: np.array, y_pred: np.array) -> float:\n", " cm = confusion_matrix(y_true, y_pred)\n", " # (true positives + true negatives) / total samples\n", " return (cm[0,0] + cm[1,1]) / np.sum(cm)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Precision\n", "\n", "Precision is the ratio of true positives to predicted positives.\n", "\n", "For our problem statement, that would be the ratio of mails correctly flagged as spam to total number of predicted spam mails. \n", "\n", "When using the model, it is important that we do not start deleting mails that are actually important, but are predicted as spam. Thus, a high precision is desirable for our classification setting.\n", "\n", "**Exercise**: implement a function to calculate precision.\n", "\n", "*Remark*: you can use the values from contigency matrix." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "def precision_score(y_true: np.array, y_pred: np.array) -> float:\n", " cm = confusion_matrix(y_true, y_pred)\n", " # true positives / predicted positives\n", " return cm[0,0] / np.sum(cm[0,:])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Recall\n", "\n", "Recall is the ratio of true positives to actual positives.\n", "\n", "For our problem statement, that would be the ratio of mails correctly flagged as spam to total number of spam mails in the test data.\n", "\n", "When using the model, it is important to correctly classify a large number of mails: a model that fails to detect spam often is not useful, as our inbox would start to fill up. Thus, a high recall is equally desirable for our classification setting." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Exercise**: implement a function to calculate recall.\n", "\n", "*Remark*: you can use the values from contigency matrix." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "def recall_score(y_true: np.array, y_pred: np.array) -> float:\n", " cm = confusion_matrix(y_true, y_pred)\n", " # true positives / actual positives\n", " return cm[0,0] / np.sum(cm[:, 0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### $F_1$-Score\n", "\n", "Precision and recall are a tradeoff: if every incoming mail is flagged as spam, we would have a recall of 1, yet a precision of 0. If only mails that congratulate you on the billionth google search and prompt you to claim your prize, it would be correct in call cases, yet only a subset of the spam is actually filtered.\n", "\n", "In both cases, we have a perfect model according to one of the two metrics, yet the model would have little utility. Therefore, a balance between precision and recall must be found. One metric that integrates both is the $F_1$-score. Given precision ($P$) and recall ($R$), the $F_1$ score is calculated as harmonic mean between the two:\n", "\n", "$$F_1 = 2\\frac{P*R}{P+R}$$" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**Exercise**: implement a function to calculate the $F_1$ score.\n", "\n", "*Remark*: you can reuse your implementation for precision and recall." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "def f1_score(y_true, y_pred):\n", " prec = precision_score(y_true, y_pred)\n", " rec = recall_score(y_true, y_pred)\n", " return 2 * (prec * rec) / (prec + rec)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Comparison to reference implementation\n", "\n", "**Exercise**: compare the results of your custom functions to the reference implementation of sklearn. \n", "\n", "*Remark*: play around with different predicted labels to see the effect on evaluation scores." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AccuracyPrecisionRecallF1 Score
Custom0.60.6670.40.5
Reference0.60.6670.40.5
\n", "
" ], "text/plain": [ " Accuracy Precision Recall F1 Score\n", "Custom 0.6 0.667 0.4 0.5\n", "Reference 0.6 0.667 0.4 0.5" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame({\n", " \"Accuracy\": (accuracy_score(y_true, y_pred), sk_accuracy(y_true, y_pred)),\n", " \"Precision\": (precision_score(y_true, y_pred), sk_precision(y_true, y_pred)),\n", " \"Recall\": (recall_score(y_true, y_pred), sk_recall(y_true, y_pred)),\n", " \"F1 Score\": (f1_score(y_true, y_pred), sk_f1(y_true, y_pred))\n", "}, index=[\"Custom\", \"Reference\"]).round(3)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Multi-class classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As the name suggests, in multiclass classification settings, not a binary distinction (Negative or Positive) is taken, but one out of multiple classes is assigned to each sample. For example, imagine the email classifier now discerns four classes: *Normal* (0), *Important* (1), *Unimportant*, (2), or *Spam* (3). \n", "The arrays below indicate the predicted and ground-truth label for 20 example mails." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "y_pred = np.array([1, 0, 1, 0, 0, 2, 2, 0, 2, 2, 1, 0, 2, 3, 1, 1, 0, 0, 2, 2])\n", "y_true = np.array([2, 0, 0, 3, 1, 0, 2, 2, 0, 1, 0, 1, 2, 3, 1, 0, 1, 1, 2, 0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Confusion matrices for multi-class settings\n", "\n", "The concept of confusion matrices naturally extends to the multi-class setting as well. Here, the co-occurrences of each class combination is counted, i.e. the confusion matrix has the shape $n\\times n$ where $n$ is the number of classes. \n", "\n", " \n", "\n", "The four categories of true and false positives/negatives thus get more complicated and can only be calculated *for each class* separately. Illustrated in the picture above, you can find the categories marked in color for a single class. The values can be calculated by summing the cells of each color.\n", "\n", "For example, consider all four cases from the viewpoint of class 2 (*Important*):\n", "- True Positive: the email is important and also classified as such\n", "- False Positives: the email is classified as important, but is not (any other class)\n", "- False Negatives: the is classified as any other class, but is important\n", "- True Negatives: the image is not classified as important and is indeed not (any other class)\n", "\n", "Since every case except true positives can be of any other class (either in the classification or the ground truth), we encounter a sum over multiple cells in all three cases." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Exercise**: write a function to calculate the confusion matrix for the multi-class setting.\n", "\n", "*Remark*: the matrix should have the shape $n\\times n$ where $n$ is the number of classes. Both input arrays should be of the same shape." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "def confusion_matrix(y_true: np.array, y_pred: np.array):\n", " assert y_true.shape == y_pred.shape\n", " classes = np.unique(y_true)\n", " num_classes = len(classes)\n", "\n", " cm = np.zeros(shape=(num_classes, num_classes), dtype=np.intc)\n", " for i, c_i in enumerate(classes): # Predicted class\n", " for j, c_j in enumerate(classes): # Actual class\n", " cm[i,j] = np.sum(np.logical_and(y_pred == c_i, y_true == c_j))\n", " return cm\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[1, 4, 1, 1],\n", " [3, 1, 1, 0],\n", " [3, 1, 3, 0],\n", " [0, 0, 0, 1]], dtype=int32)" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "confusion_matrix(y_true, y_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Micro- and Macro-Averaging\n", "\n", "For multiclass settings, since true/false positives/negatives are only defined *per class*, evaluation scores are also inutitively per-class only. The question thus becomes how scores can be aggregated across classes. Two options exist: micro- and macro-averaging. \n", "\n", "In micro-averaging, we collect the decisions for all classes into a single confusion matrix, adding the true/false positive/negative rates over all classes, and then compute precision and recall from that matrix. In macro-averaging, we compute a metric for each class separately, and then average the computed values over all classes. \n", "\n", "Micro-averaging takes class imbalance into account in the sense that the resulting performance is based on the proportion of every class, i.e. the performance of a large class has more impact on the result than of a small class. Macro-averaging doesn't take imbalance into account in the sense that the resulting performance is a simple average over the classes, so every class is given equal weight independently from their proportion." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Precision\n", "\n", "**Exercise**: Write a function that extends our previous precision calculation to the multiclass setting. The parameter average dictates which averaging method should be used (\"micro\" or \"macro\"). If None is specified, the function should return a per-class metric array.\n", "\n", "*Remark*: You can calculate TP and FP for all classes simultaneously using NumPy's vectorized functions." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def precision_score(y_true, y_pred, average=None):\n", " cm = confusion_matrix(y_true, y_pred)\n", " TP = cm.diagonal()\n", " FP = cm.sum(axis=1) - TP\n", " if average == \"micro\":\n", " return np.sum(TP) / (np.sum(TP) + np.sum(FP))\n", " elif average == \"macro\":\n", " return np.average(TP / (TP + FP))\n", " else:\n", " return TP / (TP + FP)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Recall\n", "\n", "**Exercise**: Write a function that extends our previous recall calculation to the multiclass setting. The parameter average dictates which averaging method should be used (\"micro\" or \"macro\"). If None is specified, the function should return a per-class metric array.\n", "\n", "*Remark*: You can calculate TP and FN for all classes simultaneously using NumPy's vectorized functions." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def recall_score(y_true, y_pred, average=None):\n", " cm = confusion_matrix(y_true, y_pred)\n", " TP = np.diag(cm)\n", " FN = cm.sum(axis=0) - TP\n", " if average == \"micro\":\n", " return np.sum(TP) / (np.sum(TP) + np.sum(FN))\n", " elif average == \"macro\":\n", " return np.average(TP / (TP + FN))\n", " else:\n", " return TP / (TP + FN)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### $F_1$-Score\n", "\n", "**Exercise**: Write a function that extends our previous $F_1$ calculation to the multiclass setting. The parameter average dictates which averaging method should be used (\"micro\" or \"macro\"). If None is specified, the function should return a per-class metric array.\n", "\n", "*Remark*: both precision and recall should be calculated with the corresponding averaging method." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def f1_score(y_true, y_pred, average=None):\n", " prec = precision_score(y_true, y_pred, average=average)\n", " rec = recall_score(y_true, y_pred, average=average)\n", " return (2 * prec * rec) / (prec + rec)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Comparison to Reference Implementation\n", "\n", "**Exercise**: compare the results of your custom functions to the reference implementation of sklearn. You can play around with different predicted labels to see the effect on evaluation scores. Also try out both averaging methods and see if you can spot differences. Try to explain why." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
precisionrecallf1 score
Custom0.4430.3520.3
Reference0.4430.3520.3
\n", "
" ], "text/plain": [ " precision recall f1 score\n", "Custom 0.443 0.352 0.3\n", "Reference 0.443 0.352 0.3" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame({\n", " \"precision\": (precision_score(y_true, y_pred, average=\"macro\"), sk_precision(y_true, y_pred, average=\"macro\")),\n", " \"recall\": (recall_score(y_true, y_pred, average=\"macro\"), sk_recall(y_true, y_pred, average=\"macro\")),\n", " \"f1 score\": (f1_score(y_true, y_pred, average=\"micro\"), sk_f1(y_true, y_pred, average=\"micro\"))\n", "}, index=[\"Custom\", \"Reference\"]).round(3)" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 1 }