Object Detection
How YOLO, Faster R-CNN, and modern detectors find and locate objects in images — two-stage vs one-stage approaches, bounding boxes, and real-time performance.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
Beyond “What” to “Where”
🔄 Lesson 3 covered CNNs — networks that classify entire images by detecting hierarchical patterns. But classification answers only “what is in this image?” Many applications need more: “what objects are present, and where exactly are they?”
Object detection outputs a list of bounding boxes — each with a class label (“car,” “person,” “dog”) and a confidence score. A single image might contain zero objects or dozens.
Two Approaches to Detection
Object detection evolved along two parallel tracks: two-stage detectors (accurate but slower) and one-stage detectors (faster but historically less accurate). The gap between them has narrowed significantly.
Two-Stage Detectors: R-CNN Family
The idea: First propose regions that might contain objects, then classify each region.
R-CNN (2014): Extract ~2,000 region proposals using selective search. Warp each region to a fixed size. Feed each one through a CNN for classification. Painfully slow — 47 seconds per image.
Fast R-CNN (2015): Run the CNN once on the entire image to produce a feature map. Extract features from proposed regions using this shared feature map. 25× faster than R-CNN.
Faster R-CNN (2015): Replace the slow selective search with a Region Proposal Network (RPN) — a small neural network that proposes regions from the feature map itself. End-to-end trainable. 10× faster than Fast R-CNN.
The R-CNN family achieves the highest accuracy on benchmarks but processes images at 5-15 FPS — too slow for real-time applications.
✅ Quick Check: Why is running the CNN once on the whole image (Fast R-CNN) so much faster than running it separately on each region (R-CNN)? Because the expensive part is the CNN forward pass. R-CNN runs 2,000 separate forward passes — one per region proposal. Fast R-CNN runs one forward pass on the full image, then cheaply extracts features from the shared feature map for each proposed region. The feature computation is shared across all regions, eliminating 99.95% of the redundant work.
One-Stage Detectors: YOLO
The idea: Skip region proposals entirely. Process the entire image in a single pass and predict all bounding boxes and classes simultaneously.
YOLO (You Only Look Once, 2015): Divide the image into a grid (e.g., 7×7). Each grid cell predicts bounding boxes and class probabilities. One forward pass → all detections. The original ran at 45 FPS — real-time detection was born.
The YOLO evolution:
| Version | Year | Key Innovation | Speed (T4 GPU) |
|---|---|---|---|
| YOLOv1 | 2015 | Single-pass detection | ~45 FPS |
| YOLOv3 | 2018 | Multi-scale predictions | ~35 FPS |
| YOLOv5 | 2020 | PyTorch native, easy deployment | ~140 FPS |
| YOLOv8 | 2023 | Anchor-free design | ~160 FPS |
| YOLOv10 | 2024 | No NMS needed (dual assignment) | ~180 FPS |
| YOLOv11 | 2024 | Transformer-enhanced backbone | ~2.4ms/image |
| YOLOv12 | 2025 | Area Attention module | ~1.64ms/image |
Important: Newer YOLO versions aren’t always better. Performance depends on the specific task and domain. Ultralytics (the YOLO maintainers) recommends YOLOv11 for most production workloads — it balances accuracy, speed, and stability. YOLOv12’s attention mechanism increases memory usage, which may not be worth it for every application.
Bounding Boxes and IoU
A detection is defined by:
- Bounding box: (x, y, width, height) — the rectangle around the detected object
- Class label: What the object is
- Confidence score: How confident the model is (0 to 1)
IoU (Intersection over Union) measures how well a predicted box matches the ground truth:
- IoU = (area of overlap) / (area of union)
- IoU > 0.5 is typically considered a correct detection
- IoU > 0.75 is considered a good detection
Non-Maximum Suppression (NMS)
A detector often produces multiple overlapping boxes for the same object. NMS cleans this up:
- Sort detections by confidence
- Take the highest-confidence box
- Remove all other boxes that overlap significantly (IoU > threshold)
- Repeat for the next highest-confidence remaining box
YOLOv10 eliminated NMS entirely through a dual-assignment training strategy — one head produces one box per object during inference, removing this post-processing step.
✅ Quick Check: A car in an image triggers 5 overlapping detection boxes with confidences [0.95, 0.88, 0.82, 0.79, 0.71]. After NMS with IoU threshold 0.5, how many boxes remain? One — the box with confidence 0.95. NMS keeps the highest-confidence box and suppresses all overlapping boxes (IoU > 0.5). Since all 5 boxes overlap heavily (they all detected the same car), only the best one survives. This is how you go from hundreds of raw predictions to a clean set of final detections.
The Hard Problems
Small objects: Objects under 32×32 pixels are notoriously hard to detect. Feature Pyramid Networks (FPN) help by detecting at multiple resolutions — low-resolution feature maps find large objects, high-resolution maps find small ones.
Occlusion: When one object hides behind another, the visible portion may not contain enough visual information for reliable detection. This remains an open research problem.
Class confusion: Objects from similar classes (dog vs wolf, cup vs mug) produce detection errors. Fine-grained detection requires more labeled data and specialized training.
Key Takeaways
- Object detection finds objects and locates them with bounding boxes — class label + position + confidence
- Two-stage (Faster R-CNN): highest accuracy, 5-15 FPS — best for offline/batch processing
- One-stage (YOLO): real-time speed (30-400+ FPS) with competitive accuracy — best for live applications
- YOLO evolved from v1 (2015) to v12 (2025) — but newer isn’t always better; v11 recommended for production
- IoU measures detection quality: > 0.5 is correct, > 0.75 is good
- Hard problems: small objects, occlusion, and similar-class confusion remain challenging
Up Next
Object detection draws boxes around things. But what if you need to know the exact shape of each object — pixel by pixel? Lesson 5 covers image segmentation: semantic, instance, and panoptic approaches.
Knowledge Check
Complete the quiz above first
Lesson completed!