Performance Metrics: F1-Score

What is the F1-Score?

The F1-Score has many names:

F-Score
F-Measure
Sørensen’s Similarity Coefficient
Sørensen–Dice Coefficient
Dice Similarity Coefficient (DSC)
Dice’s Coincidence Index
Hellden’s Mean Accuracy Index

The F1-Score is a metric to evaluate the performance of a binary classifier. It is calculated as the harmonic mean of the precision (PPV) and the recall (TPR). The F1-score has a value between 0 and 1. With 0 being the worst score, the result of having a precision or recall of zero. Finally, 1 is the best score, with perfect precision and recall.

$F_1 = \frac{TP}{(TP+ \frac{1}{2}(FP + FN)} = 2 \cdot \frac{precision \cdot recall}{precision + recall} = \frac{2}{recall^(-1) + precision^(-1)}$

It is best interpreted as a measure of overlap between the true and estimated classes, with a focus on TPs and ignoring TNs.

The Fβ-Score

There is also a generalized version, in which the beta can be set to add more weight to either precision or recall.

$F_\beta = (1 + \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{(\beta^2 \cdot \mathrm{precision}) + \mathrm{recall}}$

The F0.5-Score is an example of the Fβ-Score with a beta of 0.5. It raises the importance of precision relative to recall.
The F2-Score is an example of the Fβ-Score with a beta of 2. It lowers the importance of precision relative to recall.

Criticism of the F1-Score

The exclusive focus on precision and recall is not without criticism. It focuses on the positive class only and ignores true negatives (TN). That’s why it might not be the best performance metric in situations of class imbalance.

Further reading: What the F-measure doesn’t measure: Features, Flaws, Fallacies and Fixes

Performance Metrics: F1-Score

What is the F1-Score?

The Fβ-Score

Criticism of the F1-Score

Related Posts

Solving “CommandError: Unable to serialize database: ‘charmap’ codec can’t encode character…”

Starting a remote Selenium server in R

What digital professionals should know about recent privacy evolutions