How Machines See

From Pixels to Understanding

When you look at a photograph, you instantly see objects, people, depth, and context. A computer sees a grid of numbers. Bridging that gap is the core challenge of computer vision — and it starts with understanding how images are represented digitally.

Images as Numbers

A digital image is a 2D grid of pixels. Each pixel stores color information as numbers.

Grayscale images use one number per pixel — a brightness value from 0 (black) to 255 (white). A 28×28 grayscale image (the size used in the classic MNIST digit dataset) is just 784 numbers arranged in a grid.

Color images use three numbers per pixel — one each for Red, Green, and Blue (RGB). Each channel ranges from 0 to 255:

Pure red: (255, 0, 0)
Pure white: (255, 255, 255)
Pure black: (0, 0, 0)
Sky blue: (135, 206, 235)

A 224×224 color image — the standard input size for many CV models — contains 224 × 224 × 3 = 150,528 individual values. That’s what the neural network actually receives: not a “picture,” but a 3D array of numbers.

Image Resolution and Information

Resolution determines how much detail an image contains:

Resolution	Pixels	Color Values (RGB)	Use Case
28×28	784	2,352	MNIST digits, simple classification
224×224	50,176	150,528	Standard model input (ResNet, ViT)
640×640	409,600	1,228,800	YOLO object detection
1920×1080	2,073,600	6,220,800	HD camera feed
4000×4000	16,000,000	48,000,000	Satellite/medical imaging

Higher resolution means more detail but exponentially more computation. Most CV models resize images to a fixed input size (224×224 or 640×640) as the first step.

✅ Quick Check: Why do most CV models use a fixed input size like 224×224 instead of accepting any resolution? Neural networks have fixed-size weight matrices — the layers expect a specific input dimension. Also, processing a 4000×4000 image requires 300× more computation than a 224×224 image. Fixed sizes make batch processing efficient (all images in a batch must be the same shape) and keep memory usage predictable. Some architectures (like ViT) can handle variable sizes, but fixed input remains the standard for efficiency.

Image Preprocessing

Raw images need preprocessing before feeding them to a model. The standard pipeline:

1. Resizing: Scale to the model’s expected input size (usually 224×224 or 640×640).

2. Normalization: Scale pixel values from 0-255 to a smaller range, typically 0-1 or centered around 0. This helps the model train faster and more stably.

Divide by 255: maps [0, 255] → [0, 1]
Mean/std normalization: subtract dataset mean, divide by std → centers around 0

3. Channel ordering: Some frameworks expect (height, width, channels) — called “channels last.” Others expect (channels, height, width) — “channels first.” PyTorch uses channels first; TensorFlow uses channels last by default.

4. Tensor conversion: Convert from a PIL image or NumPy array to the framework’s tensor format for GPU processing.

Feature Hierarchies: What Models Actually Learn

Early computer vision (before deep learning) relied on hand-designed features — edge detectors, corner detectors, texture descriptors that engineers manually coded. These worked for simple tasks but failed on complex visual scenes.

Deep learning changed this. CNNs (which we’ll cover in Lesson 3) learn features automatically from data, building a hierarchy:

Layer Depth	What It Detects	Example
Layer 1	Edges, gradients	Horizontal lines, vertical lines, diagonal edges
Layer 2	Textures, corners	Brick patterns, fabric textures, circular shapes
Layer 3	Parts, shapes	Wheels, eyes, windows, handles
Layer 4+	Objects, scenes	Cars, faces, buildings, animals

Each layer builds on the previous one — edges combine into textures, textures into parts, parts into objects. This hierarchical feature learning is why CNNs outperform hand-designed features so dramatically.

✅ Quick Check: Why does automatic feature learning (what CNNs do) outperform hand-designed features for complex visual tasks? Hand-designed features require human engineers to anticipate what visual patterns matter. For simple tasks (detecting edges), this works fine. For complex tasks (distinguishing 1,000 dog breeds), no human can design the right features — the subtle differences in ear shape, fur texture, and muzzle proportions are too complex to specify manually. CNNs discover these features automatically from training data, often finding patterns that humans wouldn’t think to look for.

Key Takeaways

Digital images are 3D arrays of numbers: height × width × channels (RGB = 3 channels)
A 224×224 color image = 150,528 individual values that the neural network processes
Preprocessing: resize → normalize → convert to tensor — the standard pipeline for any CV model
Higher resolution preserves detail but increases computation exponentially
Deep learning replaces hand-designed features with automatic hierarchical feature learning
Feature hierarchy: edges → textures → parts → objects — each layer builds on the previous

Up Next

Now that you understand how images become numbers, Lesson 3 introduces the architecture designed specifically to process them: convolutional neural networks. CNNs exploit the spatial structure of images using clever weight sharing that makes them dramatically more efficient than feedforward networks.