# Foundations of Machine Learning - Session 02

- *Course*: Foundations of Machine Learning
- *Session*: 02
- *Unit*: Evaluation Metrics

This notebook introduces different metrics for evaluating a classification task. First, the binary case will be considered, and then extended to multiclass-classification.

In [None]:
import numpy as np
import pandas as pd
# Reference implementations
from sklearn.metrics import accuracy_score as sk_accuracy
from sklearn.metrics import precision_score as sk_precision
from sklearn.metrics import recall_score as sk_recall
from sklearn.metrics import f1_score as sk_f1
from sklearn.metrics import mean_squared_error as sk_mean_squared_error

## Binary Classification

Consider the following classification setting: a machine learning classifier is built that decides whether a picture depicts a mail is spam or not. Given below are the ground-truth labels for a test set of 10 mails and the prediction made by the classifier. The task is to now decide if the classifier is *correct*.

In [None]:
y_true = np.array([0,1,1,0,1,0,1,0,0,1])
y_pred = np.array([0,1,0,0,1,0,0,0,1,0])

### Confusion Matrix for Binary Settings

The first step for evaluation of a classification task is to build a *confusion matrix*. A confusion matrix counts the for possible kinds of match or divergence we can observe in the true and predicted label for a single sample:
- True Positive (TP): both actual and predicted label are positive (the mail is correctly flagged as spam)
- True Negative (TN): both actual and predicted label are negative (the mail correctly flagged as not spam)
- False Positive (FP): predicted label is positive, but actual label is negative (spam is flagged but the mail was not)
- False Negative (FN): predicted label is negative, but actual label is positive (the mail was spam, but is not flagged)

<img src="https://temir.org/teaching/machine-learning-ws22/materials/02-evaluation-binary-matrix.png" alt="Binary Confusion Matrix Example" width="40%" align="center"/>

**Exercise**: complete the function below to calculate the confusion matrix based on the input true and predicted labels.

*Remark*: the output should be a 2x2 numpy array with values corresponding to the picture above. The function should throw an error if the input arrays are of different shapes.

In [None]:
def confusion_matrix(y_true: np.array, y_pred: np.array):
    # TODO

### Accuracy

Accuracy is the ratio of true predictions (i.e. the sum of true positives and true negatives) to total samples.

**Exercise**: implement a function to calculate accuracy.
    
*Remark*: you can use the values from the contingency matrix.

In [None]:
def accuracy_score(y_true: np.array, y_pred: np.array) -> float:
    # TODO

### Precision

Precision is the ratio of true positives to predicted positives.

For our problem statement, that would be the ratio of mails correctly flagged as spam to total number of predicted spam mails. 

When using the model, it is important that we do not start deleting mails that are actually important, but are predicted as spam. Thus, a high precision is desirable for our classification setting.

**Exercise**: implement a function to calculate precision.

*Remark*: you can use the values from contigency matrix.

In [None]:
def precision_score(y_true: np.array, y_pred: np.array) -> float:
    # TODO

### Recall

Recall is the ratio of true positives to actual positives.

For our problem statement, that would be the ratio of mails correctly flagged as spam to total number of spam mails in the test data.

When using the model, it is important to correctly classify a large number of mails: a model that fails to detect spam often is not useful, as our inbox would start to fill up. Thus, a high recall is equally desirable for our classification setting.

**Exercise**: implement a function to calculate recall.

*Remark*: you can use the values from contigency matrix.

In [None]:
def recall_score(y_true: np.array, y_pred: np.array) -> float:
    # TODO

### $F_1$-Score

Precision and recall are a tradeoff: if every incoming mail is flagged as spam, we would have a recall of 1, yet a precision of 0. If only mails that congratulate you on the billionth google search and prompt you to claim your prize, it would be correct in call cases, yet only a subset of the spam is actually filtered.

In both cases, we have a perfect model according to one of the two metrics, yet the model would have little utility. Therefore, a balance between precision and recall must be found. One metric that integrates both is the $F_1$-score. Given precision ($P$) and recall ($R$), the $F_1$ score is calculated as harmonic mean between the two:

$$ F_1 = 2\frac{P*R}{P+R}$$

**Exercise**: implement a function to calculate the $F_1$ score.

*Remark*: you can reuse your implementation for precision and recall.

In [None]:
def f1_score(y_true, y_pred):
    # TODO

### Comparison to reference implementation

**Exercise**: compare the results of your custom functions to the reference implementation of `sklearn`. 

*Remark*: play around with different predicted labels to see the effect on evaluation scores.

In [None]:
# TODO

## Multi-class classification

As the name suggests, in multiclass classification settings, not a binary distinction (Negative or Positive) is taken, but one out of multiple classes is assigned to each sample. For example, imagine the email classifier now discerns  four classes: *Normal* (0), *Important* (1), *Unimportant*, (2), or *Spam* (3). 
The arrays below indicate the predicted and ground-truth label for 20 example mails.

In [None]:
y_pred = np.array([1, 0, 1, 0, 0, 2, 2, 0, 2, 2, 1, 0, 2, 3, 1, 1, 0, 0, 2, 2])
y_true = np.array([2, 0, 0, 3, 1, 0, 2, 2, 0, 1, 0, 1, 2, 3, 1, 0, 1, 1, 2, 0])

### Confusion matrices for multi-class settings

The concept of confusion matrices naturally extends to the multi-class setting as well. Here, the co-occurrences of each class combination is counted, i.e. the confusion matrix has the shape $n\times n$ where $n$ is the number of classes. 

<img src="https://temir.org/teaching/machine-learning-ws22/materials/02-evaluation-multi-class-matrix.png" alt="Multi-Class Confusion Matrix Example" width="40%" align="center"/>

The four categories of true and false positives/negatives thus get more complicated and can only be calculated *for each class* separately. Illustrated in the picture above, you can find the categories marked in color for a single class. The values can be calculated by summing the cells of each color.

For example, consider all four cases from the viewpoint of class 2 (*Important*):
- True Positive: the email is important and also classified as such
- False Positives: the email is classified as important, but is not (any other class)
- False Negatives: the is classified as any other class, but is important
- True Negatives: the image is not classified as important and is indeed not (any other class)

Since every case except true positives can be of any other class (either in the classification or the ground truth), we encounter a sum over multiple cells in all three cases.

**Exercise**: write a function to calculate the confusion matrix for the multi-class setting.

*Remark*: the matrix should have the shape $n\times n$ where $n$ is the number of classes. Both input arrays should be of the same shape.

In [None]:
def confusion_matrix(y_true: np.array, y_pred: np.array):
    # TODO


In [None]:
confusion_matrix(y_true, y_pred)

### Micro- and Macro-Averaging

For multiclass settings, since true/false positives/negatives are only defined *per class*, evaluation scores are also inutitively per-class only. The question thus becomes how scores can be aggregated across classes. Two options exist: micro- and macro-averaging. 

In micro-averaging, we collect the decisions for all classes into a single confusion matrix, adding the true/false positive/negative rates over all classes, and then compute precision and recall from that matrix. In macro-averaging, we compute a metric for each class separately, and then average the  computed values over all classes. 

Micro-averaging takes class imbalance into account in the sense that the resulting performance is based on the proportion of every class, i.e. the performance of a large class has more impact on the result than of a small class. Macro-averaging doesn't take imbalance into account in the sense that the resulting performance is a simple average over the classes, so every class is given equal weight independently from their proportion.

### Precision

**Exercise**: Write a function that extends our previous precision calculation to the multiclass setting. The parameter `average` dictates which averaging method should be used (`"micro" or "macro"`). If `None` is specified, the function should return a per-class metric array.

*Remark*: You can calculate TP and FP for all classes simultaneously using NumPy's vectorized functions.

In [None]:
def precision_score(y_true, y_pred, average=None):
    # TODO

### Recall

**Exercise**: Write a function that extends our previous recall calculation to the multiclass setting. The parameter `average` dictates which averaging method should be used (`"micro" or "macro"`). If `None` is specified, the function should return a per-class metric array.

*Remark*: You can calculate TP and FN for all classes simultaneously using NumPy's vectorized functions.

In [None]:
def recall_score(y_true, y_pred, average=None):
    # TODO

### $F_1$-Score

**Exercise**: Write a function that extends our previous $F_1$ calculation to the multiclass setting. The parameter `average` dictates which averaging method should be used (`"micro" or "macro"`). If `None` is specified, the function should return a per-class metric array.

In [None]:
def f1_score(y_true, y_pred, average=None):
    # TODO

### Comparison to Reference Implementation

**Exercise**: compare the results of your custom functions to the reference implementation of `sklearn`. You can play around with different predicted labels to see the effect on evaluation scores. Also try out both averaging methods and see if you can spot differences. Try to explain why.

In [None]:
# TODO