Precision / Recall / F1 / ROC / AUC Cheat Sheet

0. The running example

Assume we have 1,000 examples:

Actual positives

200

20% of the dataset

Actual negatives

800

80% of the dataset

Why imbalance matters: a dumb classifier that always predicts negative gets 800 / 1000 = 80% accuracy, but it finds zero positives. Accuracy can look good while the classifier is useless for the positive class.

1. Confusion matrix: the source of most metrics

Suppose our classifier predicts positive for 150 examples. Out of those, 120 are truly positive and 30 are actually negative.

	Actual positive	Actual negative
Predicted positive	TP = 120	FP = 30
Predicted negative	FN = 80	TN = 770

TP: true positivePredicted positive, actually positive.

FP: false positivePredicted positive, actually negative. A false alarm.

FN: false negativePredicted negative, actually positive. A missed positive.

TN: true negativePredicted negative, actually negative.

2. Precision: “When I say positive, how often am I right?”

precision = TP / (TP + FP)

In our example:

precision = 120 / (120 + 30) = 120 / 150 = 0.80

Precision intuition: Of the 150 things the model flagged as positive, 120 really were positive. So 80% of its positive predictions were correct.

High precision means few false positives. It is important when false alarms are expensive.

Examples: fraud investigation queues, medical follow-up tests, moderation actions, expensive manual review.

3. Recall: “Of all real positives, how many did I catch?”

recall = TP / (TP + FN)

In our example:

recall = 120 / (120 + 80) = 120 / 200 = 0.60

Recall intuition: There were 200 actual positives in the dataset. The model caught 120 of them. So it found 60% of the real positives.

High recall means few false negatives. It is important when missing a positive is expensive.

Examples: cancer screening, safety violations, security threats, rare but important events.

4. Precision and recall pull against each other

Most classifiers output a score, not just a hard label. You choose a threshold. Lowering the threshold predicts positive more often; raising the threshold predicts positive less often.

Lower threshold

Predict more positives.

Usually higher recall: you catch more true positives.
Usually lower precision: you also catch more false positives.

Higher threshold

Predict fewer positives.

Usually higher precision: flagged examples are more trustworthy.
Usually lower recall: you miss more true positives.

Left-to-right represents model score. To the right of the threshold, the model predicts positive. Some dots are inevitably on the “wrong” side.

5. F1-score: one number for the precision/recall balance

F1 = 2 × precision × recall / (precision + recall)

Using precision = 0.80 and recall = 0.60:

F1 = 2 × 0.80 × 0.60 / (0.80 + 0.60) = 0.686

F1 intuition: F1 is the harmonic mean of precision and recall. It is high only when both are reasonably high. A model with precision 1.0 and recall 0.1 does not get to look great.

Precision	Recall	F1	Interpretation
0.90	0.90	0.90	Both strong
1.00	0.10	0.18	Very selective, misses most positives
0.25	1.00	0.40	Catches everything, many false alarms
0.80	0.60	0.69	Our running example

Important: F1 ignores true negatives. That can be useful for imbalanced problems, but it also means F1 does not tell you how well the model handles the negative class except through false positives.

6. Specificity and false positive rate

ROC curves use recall, but they call it true positive rate.

TPR = recall = TP / (TP + FN)

They also use false positive rate:

FPR = FP / (FP + TN)

In our example:

FPR = 30 / (30 + 770) = 30 / 800 = 0.0375

Specificity is the true negative rate:

specificity = TN / (TN + FP) = 1 - FPR

Precision asks: among predicted positives, how many are real?
Recall / TPR asks: among actual positives, how many did we catch?
FPR asks: among actual negatives, how many did we incorrectly flag?

7. Threshold sweep example

Imagine the same model with different thresholds. We are not retraining the model. We are only changing how confident it must be before saying “positive.”

Threshold style	TP	FP	FN	TN	Precision	Recall / TPR	FPR	F1
Very strict	50	5	150	795	0.91	0.25	0.006	0.39
Strict	100	20	100	780	0.83	0.50	0.025	0.63
Middle	120	30	80	770	0.80	0.60	0.038	0.69
Loose	170	160	30	640	0.52	0.85	0.200	0.64
Very loose	195	500	5	300	0.28	0.98	0.625	0.44

This table is the heart of the whole topic. Precision, recall, F1, ROC, and AUC are all different ways of summarizing what happens as you move the classification threshold.

8. ROC curve: performance across all thresholds

A ROC curve plots:

x-axis = FPR and y-axis = TPR / recall

Using the threshold sweep above, the ROC points are approximately:

Threshold style	FPR	TPR / Recall
Very strict	0.006	0.25
Strict	0.025	0.50
Middle	0.038	0.60
Loose	0.200	0.85
Very loose	0.625	0.98

ROC curve from the threshold sweep

Each point is one threshold. Moving from left to right means using a looser threshold: more true positives, but also more false positives.

Good ROC behavior

The curve rises steeply toward the top-left. That means you can catch many positives while flagging relatively few negatives.

Bad ROC behavior

The curve hugs the diagonal. That means the score is not ranking positives ahead of negatives much better than random guessing.

9. AUC: “How good is the ranking?”

AUC is the area under the ROC curve. It ranges from 0 to 1.

AUC = 0.5 Random ranking. The model is not usefully separating positives from negatives.

AUC = 1.0 Perfect ranking. Every positive gets a higher score than every negative.

AUC < 0.5 Worse than random. Often means your score direction is flipped.

Useful intuition: AUC is the probability that a randomly chosen positive example gets a higher score than a randomly chosen negative example. If AUC = 0.90, then in 90% of positive-vs-negative pairs, the positive is ranked above the negative.

Why does “area under the ROC curve” equal that probability?

The ROC curve can feel geometric, but AUC has a simpler ranking interpretation.

Imagine taking one positive and one negative at random. The classifier gives each one a score. There are three possibilities:

Good pair Positive score > negative score.
The model ranked this pair correctly.

Bad pair Positive score < negative score.
The model ranked this pair backwards.

Tie Positive score = negative score.
Usually counted as half-correct.

So another way to compute AUC is:

AUC = correctly ordered positive-negative pairs / all positive-negative pairs

With our 200 positives and 800 negatives, there are:

200 × 800 = 160,000 positive-negative pairs

If the model ranks the positive above the negative in 144,000 of those pairs, then:

AUC = 144,000 / 160,000 = 0.90

Connection to the ROC area: when you sweep the threshold from strict to loose, each positive contributes vertical movement and each negative contributes horizontal movement. A positive-negative pair is counted as area exactly when the positive appears before the negative in the score ranking. So the total normalized area equals the fraction of positive-negative pairs that are correctly ordered.

AUC as pairwise ranking accuracy

Here are 5 positives and 5 negatives sorted by model score from highest to lowest. Every positive-negative pair where the positive appears earlier in the list is a correctly ordered pair.

Rank by score	Label	Negatives below this positive	Pairwise contribution
1	Positive	5	5 correct pairs
2	Negative	—	—
3	Positive	4	4 correct pairs
4	Positive	4	4 correct pairs
5	Negative	—	—
6	Negative	—	—
7	Positive	2	2 correct pairs
8	Negative	—	—
9	Positive	1	1 correct pair
10	Negative	—	—

There are 5 × 5 = 25 possible positive-negative pairs. This ranking gets 5 + 4 + 4 + 2 + 1 = 16 of them correct, so AUC = 16 / 25 = 0.64.

AUC as shaded area under the ROC curve

The more the ROC curve bows toward the top-left, the larger the area under it. The diagonal baseline has AUC = 0.5.

Why is the random baseline diagonal?

A random classifier gives scores that are unrelated to the true label. So if you take the top 10% of examples by score, you expect to get about 10% of the positives and about 10% of the negatives. If you take the top 40%, you expect to get about 40% of the positives and about 40% of the negatives.

That means:

TPR ≈ FPR

And the graph of y = x is a diagonal line.

Fraction selected by random score	Expected TPR	Expected FPR	ROC point
10%	0.10	0.10	(0.10, 0.10)
25%	0.25	0.25	(0.25, 0.25)
50%	0.50	0.50	(0.50, 0.50)
75%	0.75	0.75	(0.75, 0.75)
100%	1.00	1.00	(1.00, 1.00)

Mental picture: if the sorted list is randomly shuffled with respect to the labels, then as you walk down the list, positives and negatives accumulate at the same rate. Same rate means TPR and FPR increase together, which creates the diagonal.

Three reference ROC curves

This is the fastest visual memory hook: diagonal = random, top-left = perfect, below diagonal = probably backwards.

AUC is threshold-independent. That is useful for comparing raw ranking quality, but it does not choose an operating threshold for you. Deployment still requires deciding how many false positives and false negatives you can tolerate.

10. ROC-AUC vs PR-AUC on imbalanced data

For imbalanced datasets, ROC-AUC can sometimes look optimistic because the FPR denominator contains all negatives. In our example there are 800 negatives, so 30 false positives gives:

FPR = 30 / 800 = 0.0375

That looks tiny. But those same 30 false positives matter a lot for precision:

precision = 120 / (120 + 30) = 0.80

If false positives rise to 160:

FPR = 160 / 800 = 0.20

precision = 170 / (170 + 160) = 0.52

Rule of thumb: For imbalanced positive-class detection, always inspect precision-recall behavior, not just ROC-AUC. ROC-AUC tells you ranking quality; precision-recall tells you what happens to the flagged-positive set.

11. Choosing the right metric

Goal	Metric to watch	Why
Minimize false alarms	Precision	Measures how trustworthy positive predictions are.
Catch as many positives as possible	Recall	Measures how many actual positives you found.
Balance precision and recall	F1	Single-number summary when both errors matter.
Evaluate ranking quality independent of threshold	ROC-AUC	Measures whether positives tend to score above negatives.
Evaluate positive-class retrieval on imbalanced data	Precision-recall curve / PR-AUC	Focuses directly on the quality and coverage of positive predictions.
Deployment decision	Confusion matrix at chosen threshold	You need actual TP/FP/FN/TN counts to reason about cost.

12. The cheat sheet

Accuracy (TP + TN) / total

Can be misleading with imbalance.

Precision TP / (TP + FP)

“When I predict positive, am I right?”

Recall / TPR TP / (TP + FN)

“Of real positives, how many did I catch?”

FPR FP / (FP + TN)

“Of real negatives, how many did I falsely flag?”

Specificity / TNR TN / (TN + FP)

“Of real negatives, how many did I correctly ignore?”

F1 2PR / (P + R)

Harmonic mean of precision and recall.

ROC curve TPR vs FPR

Shows tradeoff over all thresholds.

AUC area under ROC

Same as the fraction of positive-negative pairs ranked correctly.

13. A practical workflow

Start with the base rate: here, positives are only 20%.
Train a classifier that emits scores or probabilities.
Look at ROC-AUC to understand ranking quality.
Look at the precision-recall curve because the dataset is imbalanced.
Pick a threshold based on the real cost of FP vs FN.
Report the confusion matrix, precision, recall, and F1 at that threshold.

Mental model: the model gives you a sorted list from “most likely positive” to “least likely positive.” A threshold chooses how far down the list you go. Precision asks how clean the selected prefix is. Recall asks how much of the true positive set that prefix contains. ROC-AUC asks whether the whole list is well sorted.