Image Segmentation
How to label every pixel in an image — semantic, instance, and panoptic segmentation, plus the models that power them.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
Labeling Every Pixel
🔄 Lesson 4 covered object detection — finding objects and drawing bounding boxes around them. But bounding boxes are imprecise: a rectangular box around a person includes background pixels. Many applications need exact boundaries — pixel by pixel.
Image segmentation assigns a label to every pixel in the image. It’s the most detailed form of visual understanding, and it powers applications from autonomous driving to medical imaging.
Three Types of Segmentation
Semantic Segmentation: Classify every pixel into a category — road, building, sky, car, person. All pixels of the same class get the same label. Two adjacent cars are labeled identically as “car” — you can’t tell them apart.
Instance Segmentation: Separate individual objects of the same class. Two adjacent cars get different labels: “car-1” and “car-2.” Each object gets its own pixel mask. But background categories (road, sky) aren’t labeled.
Panoptic Segmentation: The combination — every pixel gets a semantic label, and “thing” objects (cars, people) also get individual instance IDs. Complete scene understanding.
| Approach | Labels Background? | Separates Instances? | Use Case |
|---|---|---|---|
| Semantic | Yes | No | Land use mapping, road scene understanding |
| Instance | No | Yes | Counting objects, tracking individuals |
| Panoptic | Yes | Yes | Autonomous driving, complete scene parsing |
✅ Quick Check: A factory camera monitors an assembly line. Three bolts sit on a conveyor belt, and the system needs to check if each bolt is correctly positioned. Which segmentation type? Instance segmentation — you need to identify each bolt as a separate object to check its individual position. Semantic segmentation would label all three as “bolt” with no way to evaluate each one independently.
Semantic Segmentation Models
FCN (Fully Convolutional Network, 2015): The first deep learning approach to semantic segmentation. Replaced classification layers with convolutional layers that output a label map the same size as the input. Simple but produced coarse boundaries.
U-Net (2015): Designed for medical imaging where precise boundaries are critical. Uses an encoder-decoder architecture with skip connections — the encoder captures what’s in the image (context), and the decoder uses skip connections from the encoder to precisely locate boundaries. U-Net remains the standard for medical image segmentation.
DeepLab (2017-2021): Introduced atrous (dilated) convolutions that expand the receptive field without losing resolution. DeepLabV3+ combines this with an encoder-decoder structure for sharp, accurate boundaries. Widely used in autonomous driving and satellite imagery.
Instance Segmentation Models
Mask R-CNN (2017): Extends Faster R-CNN (object detection) with an additional branch that predicts a pixel-level mask for each detected object. Three outputs per detection: class, bounding box, and mask. Still widely used in production.
How Mask R-CNN works:
- Extract features from the image (CNN backbone)
- Propose regions (Region Proposal Network)
- For each region: predict class, refine bounding box, AND generate a binary pixel mask
- The mask branch adds minimal overhead to the existing detection pipeline
SAM: The Foundation Model
Segment Anything Model (SAM, Meta 2023): A foundation model for segmentation that can segment any object in any image without task-specific training. Trained on 11 million images with 1.1 billion masks.
SAM accepts three types of prompts:
- Point: Click on an object → SAM segments it
- Box: Draw a bounding box → SAM segments everything inside
- Text: Describe what to segment → SAM finds and segments it
SAM represents a paradigm shift — from training specialized segmentation models for each task to using a general-purpose model that handles novel objects zero-shot.
✅ Quick Check: When would you choose Mask R-CNN over SAM? When you need consistent, repeatable segmentation of specific known categories in a production pipeline — like segmenting tumors in medical scans or defects on an assembly line. SAM is flexible but less precise for specialized tasks. Mask R-CNN, fine-tuned on your specific data, produces more reliable results for known categories. SAM excels at interactive segmentation, annotation assistance, and tasks with novel object types.
Evaluation Metrics
IoU (Intersection over Union): Same metric as detection, applied per-pixel. How much overlap between predicted and ground-truth masks.
mIoU (mean IoU): Average IoU across all classes. The standard metric for semantic segmentation benchmarks.
AP (Average Precision): For instance segmentation — measures both detection accuracy and mask quality. Higher IoU thresholds demand more precise masks.
| Metric | What It Measures | Typical Good Score |
|---|---|---|
| mIoU | Semantic segmentation quality | 70-85% (dataset-dependent) |
| AP50 | Instance detection at IoU ≥ 0.5 | 50-65% (COCO) |
| AP75 | Instance detection at IoU ≥ 0.75 | 35-50% (COCO) |
Key Takeaways
- Semantic segmentation labels every pixel by class but can’t separate instances — road, sky, building
- Instance segmentation separates individual objects but skips background — car-1, car-2, person-1
- Panoptic segmentation combines both — complete scene understanding
- U-Net (encoder-decoder + skip connections) dominates medical imaging segmentation
- Mask R-CNN extends object detection with per-pixel masks — the production standard for instances
- SAM segments anything zero-shot — foundation model trained on 1.1 billion masks
Up Next
Training segmentation models from scratch requires massive labeled datasets — each pixel manually labeled. Lesson 6 covers how transfer learning and data augmentation let you build accurate CV models with a fraction of the data.
Knowledge Check
Complete the quiz above first
Lesson completed!