How Machines See
How digital images are represented as pixel arrays, color channels, and feature maps — plus the preprocessing steps that prepare images for AI models.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
From Pixels to Understanding
When you look at a photograph, you instantly see objects, people, depth, and context. A computer sees a grid of numbers. Bridging that gap is the core challenge of computer vision — and it starts with understanding how images are represented digitally.
Images as Numbers
A digital image is a 2D grid of pixels. Each pixel stores color information as numbers.
Grayscale images use one number per pixel — a brightness value from 0 (black) to 255 (white). A 28×28 grayscale image (the size used in the classic MNIST digit dataset) is just 784 numbers arranged in a grid.
Color images use three numbers per pixel — one each for Red, Green, and Blue (RGB). Each channel ranges from 0 to 255:
- Pure red: (255, 0, 0)
- Pure white: (255, 255, 255)
- Pure black: (0, 0, 0)
- Sky blue: (135, 206, 235)
A 224×224 color image — the standard input size for many CV models — contains 224 × 224 × 3 = 150,528 individual values. That’s what the neural network actually receives: not a “picture,” but a 3D array of numbers.
Image Resolution and Information
Resolution determines how much detail an image contains:
| Resolution | Pixels | Color Values (RGB) | Use Case |
|---|---|---|---|
| 28×28 | 784 | 2,352 | MNIST digits, simple classification |
| 224×224 | 50,176 | 150,528 | Standard model input (ResNet, ViT) |
| 640×640 | 409,600 | 1,228,800 | YOLO object detection |
| 1920×1080 | 2,073,600 | 6,220,800 | HD camera feed |
| 4000×4000 | 16,000,000 | 48,000,000 | Satellite/medical imaging |
Higher resolution means more detail but exponentially more computation. Most CV models resize images to a fixed input size (224×224 or 640×640) as the first step.
✅ Quick Check: Why do most CV models use a fixed input size like 224×224 instead of accepting any resolution? Neural networks have fixed-size weight matrices — the layers expect a specific input dimension. Also, processing a 4000×4000 image requires 300× more computation than a 224×224 image. Fixed sizes make batch processing efficient (all images in a batch must be the same shape) and keep memory usage predictable. Some architectures (like ViT) can handle variable sizes, but fixed input remains the standard for efficiency.
Image Preprocessing
Raw images need preprocessing before feeding them to a model. The standard pipeline:
1. Resizing: Scale to the model’s expected input size (usually 224×224 or 640×640).
2. Normalization: Scale pixel values from 0-255 to a smaller range, typically 0-1 or centered around 0. This helps the model train faster and more stably.
- Divide by 255: maps [0, 255] → [0, 1]
- Mean/std normalization: subtract dataset mean, divide by std → centers around 0
3. Channel ordering: Some frameworks expect (height, width, channels) — called “channels last.” Others expect (channels, height, width) — “channels first.” PyTorch uses channels first; TensorFlow uses channels last by default.
4. Tensor conversion: Convert from a PIL image or NumPy array to the framework’s tensor format for GPU processing.
Feature Hierarchies: What Models Actually Learn
Early computer vision (before deep learning) relied on hand-designed features — edge detectors, corner detectors, texture descriptors that engineers manually coded. These worked for simple tasks but failed on complex visual scenes.
Deep learning changed this. CNNs (which we’ll cover in Lesson 3) learn features automatically from data, building a hierarchy:
| Layer Depth | What It Detects | Example |
|---|---|---|
| Layer 1 | Edges, gradients | Horizontal lines, vertical lines, diagonal edges |
| Layer 2 | Textures, corners | Brick patterns, fabric textures, circular shapes |
| Layer 3 | Parts, shapes | Wheels, eyes, windows, handles |
| Layer 4+ | Objects, scenes | Cars, faces, buildings, animals |
Each layer builds on the previous one — edges combine into textures, textures into parts, parts into objects. This hierarchical feature learning is why CNNs outperform hand-designed features so dramatically.
✅ Quick Check: Why does automatic feature learning (what CNNs do) outperform hand-designed features for complex visual tasks? Hand-designed features require human engineers to anticipate what visual patterns matter. For simple tasks (detecting edges), this works fine. For complex tasks (distinguishing 1,000 dog breeds), no human can design the right features — the subtle differences in ear shape, fur texture, and muzzle proportions are too complex to specify manually. CNNs discover these features automatically from training data, often finding patterns that humans wouldn’t think to look for.
Key Takeaways
- Digital images are 3D arrays of numbers: height × width × channels (RGB = 3 channels)
- A 224×224 color image = 150,528 individual values that the neural network processes
- Preprocessing: resize → normalize → convert to tensor — the standard pipeline for any CV model
- Higher resolution preserves detail but increases computation exponentially
- Deep learning replaces hand-designed features with automatic hierarchical feature learning
- Feature hierarchy: edges → textures → parts → objects — each layer builds on the previous
Up Next
Now that you understand how images become numbers, Lesson 3 introduces the architecture designed specifically to process them: convolutional neural networks. CNNs exploit the spatial structure of images using clever weight sharing that makes them dramatically more efficient than feedforward networks.
Knowledge Check
Complete the quiz above first
Lesson completed!