0. The running example
Assume we have 1,000 examples:
1. Confusion matrix: the source of most metrics
Suppose our classifier predicts positive for 150 examples. Out of those, 120 are truly positive and 30 are actually negative.
| Actual positive | Actual negative | |
|---|---|---|
| Predicted positive | TP = 120 | FP = 30 |
| Predicted negative | FN = 80 | TN = 770 |
2. Precision: “When I say positive, how often am I right?”
precision = TP / (TP + FP)
In our example:
precision = 120 / (120 + 30) = 120 / 150 = 0.80
High precision means few false positives. It is important when false alarms are expensive.
Examples: fraud investigation queues, medical follow-up tests, moderation actions, expensive manual review.
3. Recall: “Of all real positives, how many did I catch?”
recall = TP / (TP + FN)
In our example:
recall = 120 / (120 + 80) = 120 / 200 = 0.60
High recall means few false negatives. It is important when missing a positive is expensive.
Examples: cancer screening, safety violations, security threats, rare but important events.
4. Precision and recall pull against each other
Most classifiers output a score, not just a hard label. You choose a threshold. Lowering the threshold predicts positive more often; raising the threshold predicts positive less often.
Lower threshold
Predict more positives.
- Usually higher recall: you catch more true positives.
- Usually lower precision: you also catch more false positives.
Higher threshold
Predict fewer positives.
- Usually higher precision: flagged examples are more trustworthy.
- Usually lower recall: you miss more true positives.
5. F1-score: one number for the precision/recall balance
F1 = 2 × precision × recall / (precision + recall)
Using precision = 0.80 and recall = 0.60:
F1 = 2 × 0.80 × 0.60 / (0.80 + 0.60) = 0.686
| Precision | Recall | F1 | Interpretation |
|---|---|---|---|
| 0.90 | 0.90 | 0.90 | Both strong |
| 1.00 | 0.10 | 0.18 | Very selective, misses most positives |
| 0.25 | 1.00 | 0.40 | Catches everything, many false alarms |
| 0.80 | 0.60 | 0.69 | Our running example |
6. Specificity and false positive rate
ROC curves use recall, but they call it true positive rate.
TPR = recall = TP / (TP + FN)
They also use false positive rate:
FPR = FP / (FP + TN)
In our example:
FPR = 30 / (30 + 770) = 30 / 800 = 0.0375
Specificity is the true negative rate:
specificity = TN / (TN + FP) = 1 - FPR
Recall / TPR asks: among actual positives, how many did we catch?
FPR asks: among actual negatives, how many did we incorrectly flag?
7. Threshold sweep example
Imagine the same model with different thresholds. We are not retraining the model. We are only changing how confident it must be before saying “positive.”
| Threshold style | TP | FP | FN | TN | Precision | Recall / TPR | FPR | F1 |
|---|---|---|---|---|---|---|---|---|
| Very strict | 50 | 5 | 150 | 795 | 0.91 | 0.25 | 0.006 | 0.39 |
| Strict | 100 | 20 | 100 | 780 | 0.83 | 0.50 | 0.025 | 0.63 |
| Middle | 120 | 30 | 80 | 770 | 0.80 | 0.60 | 0.038 | 0.69 |
| Loose | 170 | 160 | 30 | 640 | 0.52 | 0.85 | 0.200 | 0.64 |
| Very loose | 195 | 500 | 5 | 300 | 0.28 | 0.98 | 0.625 | 0.44 |
8. ROC curve: performance across all thresholds
A ROC curve plots:
x-axis = FPR and y-axis = TPR / recall
Using the threshold sweep above, the ROC points are approximately:
| Threshold style | FPR | TPR / Recall |
|---|---|---|
| Very strict | 0.006 | 0.25 |
| Strict | 0.025 | 0.50 |
| Middle | 0.038 | 0.60 |
| Loose | 0.200 | 0.85 |
| Very loose | 0.625 | 0.98 |
Each point is one threshold. Moving from left to right means using a looser threshold: more true positives, but also more false positives.
Good ROC behavior
The curve rises steeply toward the top-left. That means you can catch many positives while flagging relatively few negatives.
Bad ROC behavior
The curve hugs the diagonal. That means the score is not ranking positives ahead of negatives much better than random guessing.
9. AUC: “How good is the ranking?”
AUC is the area under the ROC curve. It ranges from 0 to 1.
Why does “area under the ROC curve” equal that probability?
The ROC curve can feel geometric, but AUC has a simpler ranking interpretation.
Imagine taking one positive and one negative at random. The classifier gives each one a score. There are three possibilities:
The model ranked this pair correctly.
The model ranked this pair backwards.
Usually counted as half-correct.
So another way to compute AUC is:
AUC = correctly ordered positive-negative pairs / all positive-negative pairs
With our 200 positives and 800 negatives, there are:
200 × 800 = 160,000 positive-negative pairs
If the model ranks the positive above the negative in 144,000 of those pairs, then:
AUC = 144,000 / 160,000 = 0.90
Here are 5 positives and 5 negatives sorted by model score from highest to lowest. Every positive-negative pair where the positive appears earlier in the list is a correctly ordered pair.
| Rank by score | Label | Negatives below this positive | Pairwise contribution |
|---|---|---|---|
| 1 | Positive | 5 | 5 correct pairs |
| 2 | Negative | — | — |
| 3 | Positive | 4 | 4 correct pairs |
| 4 | Positive | 4 | 4 correct pairs |
| 5 | Negative | — | — |
| 6 | Negative | — | — |
| 7 | Positive | 2 | 2 correct pairs |
| 8 | Negative | — | — |
| 9 | Positive | 1 | 1 correct pair |
| 10 | Negative | — | — |
There are 5 × 5 = 25 possible positive-negative pairs. This ranking gets 5 + 4 + 4 + 2 + 1 = 16 of them correct, so AUC = 16 / 25 = 0.64.
The more the ROC curve bows toward the top-left, the larger the area under it. The diagonal baseline has AUC = 0.5.
Why is the random baseline diagonal?
A random classifier gives scores that are unrelated to the true label. So if you take the top 10% of examples by score, you expect to get about 10% of the positives and about 10% of the negatives. If you take the top 40%, you expect to get about 40% of the positives and about 40% of the negatives.
That means:
TPR ≈ FPR
And the graph of y = x is a diagonal line.
| Fraction selected by random score | Expected TPR | Expected FPR | ROC point |
|---|---|---|---|
| 10% | 0.10 | 0.10 | (0.10, 0.10) |
| 25% | 0.25 | 0.25 | (0.25, 0.25) |
| 50% | 0.50 | 0.50 | (0.50, 0.50) |
| 75% | 0.75 | 0.75 | (0.75, 0.75) |
| 100% | 1.00 | 1.00 | (1.00, 1.00) |
This is the fastest visual memory hook: diagonal = random, top-left = perfect, below diagonal = probably backwards.
10. ROC-AUC vs PR-AUC on imbalanced data
For imbalanced datasets, ROC-AUC can sometimes look optimistic because the FPR denominator contains all negatives. In our example there are 800 negatives, so 30 false positives gives:
FPR = 30 / 800 = 0.0375
That looks tiny. But those same 30 false positives matter a lot for precision:
precision = 120 / (120 + 30) = 0.80
If false positives rise to 160:
FPR = 160 / 800 = 0.20
precision = 170 / (170 + 160) = 0.52
11. Choosing the right metric
| Goal | Metric to watch | Why |
|---|---|---|
| Minimize false alarms | Precision | Measures how trustworthy positive predictions are. |
| Catch as many positives as possible | Recall | Measures how many actual positives you found. |
| Balance precision and recall | F1 | Single-number summary when both errors matter. |
| Evaluate ranking quality independent of threshold | ROC-AUC | Measures whether positives tend to score above negatives. |
| Evaluate positive-class retrieval on imbalanced data | Precision-recall curve / PR-AUC | Focuses directly on the quality and coverage of positive predictions. |
| Deployment decision | Confusion matrix at chosen threshold | You need actual TP/FP/FN/TN counts to reason about cost. |
12. The cheat sheet
Can be misleading with imbalance.
“When I predict positive, am I right?”
“Of real positives, how many did I catch?”
“Of real negatives, how many did I falsely flag?”
“Of real negatives, how many did I correctly ignore?”
Harmonic mean of precision and recall.
Shows tradeoff over all thresholds.
Same as the fraction of positive-negative pairs ranked correctly.
13. A practical workflow
- Start with the base rate: here, positives are only 20%.
- Train a classifier that emits scores or probabilities.
- Look at ROC-AUC to understand ranking quality.
- Look at the precision-recall curve because the dataset is imbalanced.
- Pick a threshold based on the real cost of FP vs FN.
- Report the confusion matrix, precision, recall, and F1 at that threshold.