Tools & Frameworks

The Software Stack

🔄 Lessons 3-5 covered algorithms, data, and evaluation. Now let’s look at the tools that turn these concepts into working systems. The ML software stack has clear layers, each with a specific job.

The Foundation: Python

Python dominates ML. Not because it’s the fastest language (it isn’t), but because its ecosystem of libraries makes ML practical:

Library	Role	Analogy
NumPy	Numerical computation	The calculator
pandas	Data manipulation	The spreadsheet
matplotlib	Visualization	The chart maker
scikit-learn	Traditional ML	The algorithms
PyTorch	Deep learning (research)	The neural network lab
TensorFlow	Deep learning (production)	The neural network factory
Keras	Deep learning (simplified)	The easy button

Every ML project uses NumPy and pandas. The ML framework you add on top depends on your problem type.

pandas: Data Preparation

Every ML project starts with pandas. It’s the tool that loads, explores, cleans, and transforms data before any algorithm touches it.

What pandas does:

Load data from CSV, Excel, SQL databases, JSON
Explore data: column types, missing values, summary statistics
Clean data: handle missing values, remove duplicates, fix formats
Engineer features: create new columns, encode categories, aggregate
Export clean data for ML training

The workflow:

Load CSV → Explore (describe, info) → Clean (fillna, dropna) →
Engineer features → Export for ML

pandas handles the 80% of ML work that isn’t algorithms. Data scientists joke that they spend 80% of their time on data preparation and 20% on actual modeling. pandas is the 80% tool.

scikit-learn: Traditional Machine Learning

scikit-learn is the gold standard for traditional ML algorithms. If your data lives in a spreadsheet and you’re not using neural networks, scikit-learn is almost certainly the right choice.

What it covers:

Classification: random forests, SVM, logistic regression, k-nearest neighbors
Regression: linear regression, decision trees, gradient boosting
Clustering: K-means, DBSCAN, hierarchical clustering
Preprocessing: scaling, encoding, imputation
Evaluation: accuracy, precision, recall, cross-validation
Model selection: grid search, hyperparameter tuning

Why ML practitioners love it: Consistent API. Every algorithm follows the same pattern: create the model, .fit() on training data, .predict() on new data. Learn the pattern once, apply it to any algorithm.

When to use scikit-learn: Structured/tabular data (spreadsheets, databases, CSV files), traditional algorithms (not deep learning), projects where interpretability matters, quick prototyping.

✅ Quick Check: You need to build a spam classifier. Your data is a CSV with 15 columns (word counts, sender info, email metadata). Would you use scikit-learn or PyTorch? scikit-learn — this is structured tabular data with 15 features. A random forest or logistic regression in scikit-learn handles this perfectly in a few lines of code. PyTorch is built for neural networks on unstructured data. Using it here adds complexity without adding value.

PyTorch: Deep Learning for Research

PyTorch is the dominant framework for deep learning research and education. 60%+ of beginners choose it first, and most new ML research papers use it.

Key strength: dynamic computation graphs. PyTorch code runs like regular Python — you can print values, set breakpoints, and step through execution line by line. This makes debugging intuitive, which is crucial when you’re learning or experimenting.

When to use PyTorch:

Neural networks (CNNs for images, transformers for text, RNNs for sequences)
Research and experimentation
Learning deep learning concepts
Projects where you need flexibility to customize architecture

TensorFlow: Deep Learning for Production

TensorFlow powers ML systems at Google and many large enterprises. It’s optimized for production deployment — serving models to millions of users, running on mobile devices, and scaling across data centers.

Key strength: production ecosystem. TF Serving deploys models as APIs. TF Lite runs models on mobile and edge devices. TF.js runs in browsers. The ecosystem is built for taking a trained model and putting it in front of users.

When to use TensorFlow:

Production deployment at scale
Mobile or edge device ML
Enterprise ML infrastructure
When your organization already uses TensorFlow

Keras: The Easy Button

Keras is a high-level interface for building neural networks. It abstracts away the complex details, letting you build and train models in a few lines.

What makes it beginner-friendly: Instead of defining every matrix multiplication and gradient calculation, you describe the network structure in plain terms: “A layer with 128 neurons, then a layer with 64 neurons, then an output layer with 10 classes.”

Keras runs on top of TensorFlow (it’s built into TensorFlow as tf.keras). Think of it as a simplified control panel for TensorFlow’s engine.

When to use Keras: Your first deep learning project, quick prototyping, when you want results fast without deep framework knowledge.

Choosing Your Framework

Question	Answer
Structured data + traditional algorithms?	scikit-learn
Images/text/audio + neural networks?	PyTorch (learning) or TensorFlow (production)
First deep learning project?	Keras (simplest) or PyTorch (most intuitive)
Deploying to mobile/edge?	TensorFlow + TF Lite
Research and experimentation?	PyTorch
Enterprise deployment?	TensorFlow

✅ Quick Check: A startup wants to build a recommendation system. They’ll prototype quickly, iterate fast, and eventually deploy to production. What’s their framework path? Start with scikit-learn for a baseline (collaborative filtering is traditional ML). If performance needs deep learning, prototype in PyTorch (faster iteration). Deploy the final model with TensorFlow Serving (production-ready). This prototype→deploy pattern — PyTorch for research, TensorFlow for production — is common in industry.

Key Takeaways

Python is the language of ML — NumPy, pandas, matplotlib form the foundation
pandas handles data preparation (80% of ML work) — loading, cleaning, engineering features
scikit-learn is the gold standard for traditional ML on structured data — consistent API, broad algorithm coverage
PyTorch dominates research and learning — dynamic graphs, intuitive debugging, 60% beginner adoption
TensorFlow dominates production — model serving, mobile deployment, enterprise scale
Keras simplifies deep learning — high-level interface on top of TensorFlow
Choose by problem type: structured data → scikit-learn, neural networks → PyTorch/TensorFlow

Up Next

You understand the concepts, algorithms, data, evaluation, and tools. Lesson 7 puts it all in context — real-world ML applications across industries, plus the ethical challenges of algorithmic bias, fairness, and accountability.