Image Segmentation

Labeling Every Pixel

🔄 Lesson 4 covered object detection — finding objects and drawing bounding boxes around them. But bounding boxes are imprecise: a rectangular box around a person includes background pixels. Many applications need exact boundaries — pixel by pixel.

Image segmentation assigns a label to every pixel in the image. It’s the most detailed form of visual understanding, and it powers applications from autonomous driving to medical imaging.

Three Types of Segmentation

Semantic Segmentation: Classify every pixel into a category — road, building, sky, car, person. All pixels of the same class get the same label. Two adjacent cars are labeled identically as “car” — you can’t tell them apart.

Instance Segmentation: Separate individual objects of the same class. Two adjacent cars get different labels: “car-1” and “car-2.” Each object gets its own pixel mask. But background categories (road, sky) aren’t labeled.

Panoptic Segmentation: The combination — every pixel gets a semantic label, and “thing” objects (cars, people) also get individual instance IDs. Complete scene understanding.

Approach	Labels Background?	Separates Instances?	Use Case
Semantic	Yes	No	Land use mapping, road scene understanding
Instance	No	Yes	Counting objects, tracking individuals
Panoptic	Yes	Yes	Autonomous driving, complete scene parsing

✅ Quick Check: A factory camera monitors an assembly line. Three bolts sit on a conveyor belt, and the system needs to check if each bolt is correctly positioned. Which segmentation type? Instance segmentation — you need to identify each bolt as a separate object to check its individual position. Semantic segmentation would label all three as “bolt” with no way to evaluate each one independently.

Semantic Segmentation Models

FCN (Fully Convolutional Network, 2015): The first deep learning approach to semantic segmentation. Replaced classification layers with convolutional layers that output a label map the same size as the input. Simple but produced coarse boundaries.

U-Net (2015): Designed for medical imaging where precise boundaries are critical. Uses an encoder-decoder architecture with skip connections — the encoder captures what’s in the image (context), and the decoder uses skip connections from the encoder to precisely locate boundaries. U-Net remains the standard for medical image segmentation.

DeepLab (2017-2021): Introduced atrous (dilated) convolutions that expand the receptive field without losing resolution. DeepLabV3+ combines this with an encoder-decoder structure for sharp, accurate boundaries. Widely used in autonomous driving and satellite imagery.

Instance Segmentation Models

Mask R-CNN (2017): Extends Faster R-CNN (object detection) with an additional branch that predicts a pixel-level mask for each detected object. Three outputs per detection: class, bounding box, and mask. Still widely used in production.

How Mask R-CNN works:

Extract features from the image (CNN backbone)
Propose regions (Region Proposal Network)
For each region: predict class, refine bounding box, AND generate a binary pixel mask
The mask branch adds minimal overhead to the existing detection pipeline

SAM: The Foundation Model

Segment Anything Model (SAM, Meta 2023): A foundation model for segmentation that can segment any object in any image without task-specific training. Trained on 11 million images with 1.1 billion masks.

SAM accepts three types of prompts:

Point: Click on an object → SAM segments it
Box: Draw a bounding box → SAM segments everything inside
Text: Describe what to segment → SAM finds and segments it

SAM represents a paradigm shift — from training specialized segmentation models for each task to using a general-purpose model that handles novel objects zero-shot.

✅ Quick Check: When would you choose Mask R-CNN over SAM? When you need consistent, repeatable segmentation of specific known categories in a production pipeline — like segmenting tumors in medical scans or defects on an assembly line. SAM is flexible but less precise for specialized tasks. Mask R-CNN, fine-tuned on your specific data, produces more reliable results for known categories. SAM excels at interactive segmentation, annotation assistance, and tasks with novel object types.

Evaluation Metrics

IoU (Intersection over Union): Same metric as detection, applied per-pixel. How much overlap between predicted and ground-truth masks.

mIoU (mean IoU): Average IoU across all classes. The standard metric for semantic segmentation benchmarks.

AP (Average Precision): For instance segmentation — measures both detection accuracy and mask quality. Higher IoU thresholds demand more precise masks.

Metric	What It Measures	Typical Good Score
mIoU	Semantic segmentation quality	70-85% (dataset-dependent)
AP50	Instance detection at IoU ≥ 0.5	50-65% (COCO)
AP75	Instance detection at IoU ≥ 0.75	35-50% (COCO)

Key Takeaways

Semantic segmentation labels every pixel by class but can’t separate instances — road, sky, building
Instance segmentation separates individual objects but skips background — car-1, car-2, person-1
Panoptic segmentation combines both — complete scene understanding
U-Net (encoder-decoder + skip connections) dominates medical imaging segmentation
Mask R-CNN extends object detection with per-pixel masks — the production standard for instances
SAM segments anything zero-shot — foundation model trained on 1.1 billion masks

Up Next

Training segmentation models from scratch requires massive labeled datasets — each pixel manually labeled. Lesson 6 covers how transfer learning and data augmentation let you build accurate CV models with a fraction of the data.