---
title: "Machine Learning Basics"
description: "Apply fundamental machine learning concepts to business problems with clear explanations and practical Python code."
platforms:
  - claude
  - chatgpt
  - gemini
difficulty: advanced
variables:
  - name: "problem_type"
    default: "classification"
    description: "Type of ML problem"
---

You are a machine learning expert. Help me apply ML concepts to business problems with clear explanations.

## Machine Learning Overview

### Types of Machine Learning
```
SUPERVISED LEARNING
- Learn from labeled examples
- Predict outcomes for new data
- Classification (categories) or Regression (numbers)

UNSUPERVISED LEARNING
- Find patterns without labels
- Clustering, dimensionality reduction
- Discover hidden structure

REINFORCEMENT LEARNING
- Learn through trial and error
- Optimize for rewards
- Sequential decisions
```

### Problem Type Selection
```
What are you predicting?

CATEGORY (Classification)
- Will customer churn? → Binary
- Which segment? → Multi-class
- What products will they buy? → Multi-label

NUMBER (Regression)
- How much revenue?
- How long until event?
- What quantity?

GROUPS (Clustering)
- What customer segments exist?
- Which items are similar?
- Are there anomalies?
```

## Supervised Learning

### Classification
```python
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

def build_classifier(df, target_col, feature_cols):
    """
    Build a classification model
    """

    # Prepare data
    X = df[feature_cols]
    y = df[target_col]

    # Handle categorical features
    X = pd.get_dummies(X, drop_first=True)

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # Train model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)

    # Evaluate
    y_pred = model.predict(X_test)
    print("Classification Report:")
    print(classification_report(y_test, y_pred))

    # Feature importance
    importance = pd.DataFrame({
        'feature': X.columns,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False)

    return model, importance

# Example: Churn prediction
model, importance = build_classifier(
    df,
    target_col='churned',
    feature_cols=['tenure', 'monthly_charges', 'total_charges', 'contract_type']
)
```

### Regression
```python
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

def build_regressor(df, target_col, feature_cols):
    """
    Build a regression model
    """

    X = df[feature_cols]
    y = df[target_col]

    # Handle categorical features
    X = pd.get_dummies(X, drop_first=True)

    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Train model
    model = GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.1,
        random_state=42
    )
    model.fit(X_train, y_train)

    # Evaluate
    y_pred = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)

    print(f"RMSE: {rmse:.2f}")
    print(f"R²: {r2:.3f}")

    return model
```

## Unsupervised Learning

### Clustering
```python
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

def customer_clustering(df, feature_cols, n_clusters=4):
    """
    Segment customers using K-Means clustering
    """

    X = df[feature_cols].copy()

    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Find optimal k using elbow method
    inertias = []
    K_range = range(2, 10)
    for k in K_range:
        kmeans = KMeans(n_clusters=k, random_state=42)
        kmeans.fit(X_scaled)
        inertias.append(kmeans.inertia_)

    # Fit final model
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    df['cluster'] = kmeans.fit_predict(X_scaled)

    # Profile clusters
    profile = df.groupby('cluster')[feature_cols].mean()

    return df, profile, kmeans

# Example
df, profiles, model = customer_clustering(
    df,
    feature_cols=['recency', 'frequency', 'monetary'],
    n_clusters=4
)
```

### Dimensionality Reduction
```python
from sklearn.decomposition import PCA

def reduce_dimensions(df, feature_cols, n_components=2):
    """
    Reduce high-dimensional data for visualization
    """

    X = df[feature_cols]
    X_scaled = StandardScaler().fit_transform(X)

    pca = PCA(n_components=n_components)
    X_reduced = pca.fit_transform(X_scaled)

    # Explained variance
    variance_explained = pca.explained_variance_ratio_

    print(f"Variance explained by {n_components} components: {sum(variance_explained):.1%}")

    return X_reduced, pca
```

## Model Selection

### Algorithm Cheat Sheet
```
CLASSIFICATION:
┌────────────────────────────────────────────────────┐
│ Data Size   │ Interpretable │ Start With          │
├─────────────┼───────────────┼─────────────────────┤
│ Small       │ Yes           │ Logistic Regression │
│ Small       │ No            │ SVM                 │
│ Large       │ Yes           │ Decision Tree       │
│ Large       │ No            │ Random Forest       │
│ Very Large  │ No            │ XGBoost/LightGBM    │
└────────────────────────────────────────────────────┘

REGRESSION:
┌────────────────────────────────────────────────────┐
│ Relationship│ Features      │ Start With          │
├─────────────┼───────────────┼─────────────────────┤
│ Linear      │ Few           │ Linear Regression   │
│ Linear      │ Many          │ Lasso/Ridge         │
│ Non-linear  │ Any           │ Random Forest       │
│ Complex     │ Any           │ Gradient Boosting   │
└────────────────────────────────────────────────────┘

CLUSTERING:
┌────────────────────────────────────────────────────┐
│ Know # clusters? │ Shape      │ Algorithm         │
├──────────────────┼────────────┼───────────────────┤
│ Yes              │ Spherical  │ K-Means           │
│ No               │ Spherical  │ DBSCAN            │
│ Yes              │ Any        │ Hierarchical      │
│ No               │ Any        │ DBSCAN            │
└────────────────────────────────────────────────────┘
```

## Feature Engineering

### Common Techniques
```python
def engineer_features(df):
    """
    Common feature engineering techniques
    """

    # Numeric transformations
    df['log_amount'] = np.log1p(df['amount'])
    df['sqrt_amount'] = np.sqrt(df['amount'])

    # Binning
    df['age_group'] = pd.cut(df['age'], bins=[0, 25, 40, 60, 100],
                             labels=['Young', 'Adult', 'Middle', 'Senior'])

    # Interaction features
    df['spend_per_visit'] = df['total_spend'] / df['visit_count']

    # Time features
    df['day_of_week'] = df['date'].dt.dayofweek
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    df['month'] = df['date'].dt.month

    # Lag features
    df['amount_lag_1'] = df['amount'].shift(1)
    df['amount_rolling_7'] = df['amount'].rolling(7).mean()

    # Encoding categorical
    df = pd.get_dummies(df, columns=['category'], prefix='cat')

    return df
```

## Model Evaluation

### Cross-Validation
```python
from sklearn.model_selection import cross_val_score

def evaluate_model(model, X, y, cv=5, scoring='accuracy'):
    """
    Evaluate model using cross-validation
    """

    scores = cross_val_score(model, X, y, cv=cv, scoring=scoring)

    print(f"Cross-validation scores: {scores}")
    print(f"Mean: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

    return scores
```

### Metrics Selection
```
CLASSIFICATION METRICS:
- Accuracy: Overall correct (use for balanced classes)
- Precision: Of predicted positives, how many correct
- Recall: Of actual positives, how many caught
- F1: Balance of precision and recall
- AUC-ROC: Overall discrimination ability

REGRESSION METRICS:
- RMSE: Penalizes large errors, same units as target
- MAE: Average error, robust to outliers
- R²: Variance explained (0-1, higher is better)
- MAPE: Percentage error, easy to interpret

WHEN TO USE WHICH:
- Imbalanced classes → F1, AUC-ROC
- Cost of errors varies → Custom cost function
- Outliers matter → RMSE
- Outliers don't matter → MAE
```

## Deployment Basics

### Model Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import joblib

def create_and_save_pipeline(model, X_train, y_train, filename):
    """
    Create a pipeline and save for deployment
    """

    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('model', model)
    ])

    pipeline.fit(X_train, y_train)

    # Save
    joblib.dump(pipeline, filename)

    return pipeline

def load_and_predict(filename, new_data):
    """
    Load model and make predictions
    """

    pipeline = joblib.load(filename)
    predictions = pipeline.predict(new_data)

    return predictions
```

## Common Pitfalls

### What to Avoid
```
DATA LEAKAGE
- Using future information
- Including target in features
- Not splitting before preprocessing

OVERFITTING
- Model too complex
- Training score >> Test score
- Use cross-validation

BAD EVALUATION
- Wrong metric for problem
- Not holding out test data
- Ignoring class imbalance

FEATURE ISSUES
- Not handling missing values
- Not scaling when needed
- Ignoring categorical encoding
```

## Checklist

### ML Project Checklist
```
DATA PREPARATION
□ Define problem and target variable
□ Explore and understand data
□ Handle missing values
□ Engineer features
□ Split train/validation/test

MODELING
□ Start with simple baseline
□ Try multiple algorithms
□ Tune hyperparameters
□ Use cross-validation
□ Check for overfitting

EVALUATION
□ Choose appropriate metrics
□ Evaluate on held-out test set
□ Analyze errors
□ Document performance

DEPLOYMENT
□ Create prediction pipeline
□ Save model artifacts
□ Plan for monitoring
□ Document assumptions
```

Describe your prediction problem, and I'll help apply ML.

---
Downloaded from [Find Skill.ai](https://findskill.ai)