Object Detection

Beyond “What” to “Where”

🔄 Lesson 3 covered CNNs — networks that classify entire images by detecting hierarchical patterns. But classification answers only “what is in this image?” Many applications need more: “what objects are present, and where exactly are they?”

Object detection outputs a list of bounding boxes — each with a class label (“car,” “person,” “dog”) and a confidence score. A single image might contain zero objects or dozens.

Two Approaches to Detection

Object detection evolved along two parallel tracks: two-stage detectors (accurate but slower) and one-stage detectors (faster but historically less accurate). The gap between them has narrowed significantly.

Two-Stage Detectors: R-CNN Family

The idea: First propose regions that might contain objects, then classify each region.

R-CNN (2014): Extract ~2,000 region proposals using selective search. Warp each region to a fixed size. Feed each one through a CNN for classification. Painfully slow — 47 seconds per image.

Fast R-CNN (2015): Run the CNN once on the entire image to produce a feature map. Extract features from proposed regions using this shared feature map. 25× faster than R-CNN.

Faster R-CNN (2015): Replace the slow selective search with a Region Proposal Network (RPN) — a small neural network that proposes regions from the feature map itself. End-to-end trainable. 10× faster than Fast R-CNN.

The R-CNN family achieves the highest accuracy on benchmarks but processes images at 5-15 FPS — too slow for real-time applications.

✅ Quick Check: Why is running the CNN once on the whole image (Fast R-CNN) so much faster than running it separately on each region (R-CNN)? Because the expensive part is the CNN forward pass. R-CNN runs 2,000 separate forward passes — one per region proposal. Fast R-CNN runs one forward pass on the full image, then cheaply extracts features from the shared feature map for each proposed region. The feature computation is shared across all regions, eliminating 99.95% of the redundant work.

One-Stage Detectors: YOLO

The idea: Skip region proposals entirely. Process the entire image in a single pass and predict all bounding boxes and classes simultaneously.

YOLO (You Only Look Once, 2015): Divide the image into a grid (e.g., 7×7). Each grid cell predicts bounding boxes and class probabilities. One forward pass → all detections. The original ran at 45 FPS — real-time detection was born.

The YOLO evolution:

Version	Year	Key Innovation	Speed (T4 GPU)
YOLOv1	2015	Single-pass detection	~45 FPS
YOLOv3	2018	Multi-scale predictions	~35 FPS
YOLOv5	2020	PyTorch native, easy deployment	~140 FPS
YOLOv8	2023	Anchor-free design	~160 FPS
YOLOv10	2024	No NMS needed (dual assignment)	~180 FPS
YOLOv11	2024	Transformer-enhanced backbone	~2.4ms/image
YOLOv12	2025	Area Attention module	~1.64ms/image

Important: Newer YOLO versions aren’t always better. Performance depends on the specific task and domain. Ultralytics (the YOLO maintainers) recommends YOLOv11 for most production workloads — it balances accuracy, speed, and stability. YOLOv12’s attention mechanism increases memory usage, which may not be worth it for every application.

Bounding Boxes and IoU

A detection is defined by:

Bounding box: (x, y, width, height) — the rectangle around the detected object
Class label: What the object is
Confidence score: How confident the model is (0 to 1)

IoU (Intersection over Union) measures how well a predicted box matches the ground truth:

IoU = (area of overlap) / (area of union)
IoU > 0.5 is typically considered a correct detection
IoU > 0.75 is considered a good detection

Non-Maximum Suppression (NMS)

A detector often produces multiple overlapping boxes for the same object. NMS cleans this up:

Sort detections by confidence
Take the highest-confidence box
Remove all other boxes that overlap significantly (IoU > threshold)
Repeat for the next highest-confidence remaining box

YOLOv10 eliminated NMS entirely through a dual-assignment training strategy — one head produces one box per object during inference, removing this post-processing step.

✅ Quick Check: A car in an image triggers 5 overlapping detection boxes with confidences [0.95, 0.88, 0.82, 0.79, 0.71]. After NMS with IoU threshold 0.5, how many boxes remain? One — the box with confidence 0.95. NMS keeps the highest-confidence box and suppresses all overlapping boxes (IoU > 0.5). Since all 5 boxes overlap heavily (they all detected the same car), only the best one survives. This is how you go from hundreds of raw predictions to a clean set of final detections.

The Hard Problems

Small objects: Objects under 32×32 pixels are notoriously hard to detect. Feature Pyramid Networks (FPN) help by detecting at multiple resolutions — low-resolution feature maps find large objects, high-resolution maps find small ones.

Occlusion: When one object hides behind another, the visible portion may not contain enough visual information for reliable detection. This remains an open research problem.

Class confusion: Objects from similar classes (dog vs wolf, cup vs mug) produce detection errors. Fine-grained detection requires more labeled data and specialized training.

Key Takeaways

Object detection finds objects and locates them with bounding boxes — class label + position + confidence
Two-stage (Faster R-CNN): highest accuracy, 5-15 FPS — best for offline/batch processing
One-stage (YOLO): real-time speed (30-400+ FPS) with competitive accuracy — best for live applications
YOLO evolved from v1 (2015) to v12 (2025) — but newer isn’t always better; v11 recommended for production
IoU measures detection quality: > 0.5 is correct, > 0.75 is good
Hard problems: small objects, occlusion, and similar-class confusion remain challenging

Up Next

Object detection draws boxes around things. But what if you need to know the exact shape of each object — pixel by pixel? Lesson 5 covers image segmentation: semantic, instance, and panoptic approaches.