8/8

Lesson 8 12 min

Capstone: Production Deployment & Use Cases

Deploy your fine-tuned model to production: adapter merging, serving options, cost analysis, monitoring, and real-world SLM use cases.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

🔄 You’ve built a fine-tuned model. It passes evaluation. Now the question every tutorial avoids: how do you actually deploy this thing? And more importantly — when does the math justify running your own model instead of calling an API?

This lesson covers the production side: deployment options, cost analysis, monitoring, and real-world use cases where fine-tuned SLMs replace much larger models.

Deployment Options

You have three paths from trained model to production:

Option 1: Merged Model on a GPU Server

Merge LoRA adapters into the base model, then serve with vLLM or TGI:

# Merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("production-model")

# Serve with vLLM (example command)
# python -m vllm.entrypoints.openai.api_server \
#     --model production-model --port 8000

Best for: High-volume production, single task Cost: $1-3/hour GPU rental (RunPod, Lambda) Latency: 50-200ms (fast — no network hop to external API)

Option 2: Adapter Hot-Swapping

Keep adapters separate. Load the base model once, swap adapters per request:

Base model (loaded once, 3.5 GB in 4-bit)
├── /support → support-adapter.bin (50 MB)
├── /legal → legal-adapter.bin (80 MB)
└── /code → code-adapter.bin (60 MB)

Frameworks like LoRAX and vLLM support dynamic adapter loading. One GPU serves multiple specialized models.

Best for: Multi-task serving, A/B testing different adapters Cost: Same GPU, multiple models Latency: Slight overhead for adapter switching (~10-50ms)

Option 3: OpenAI’s Hosted Fine-Tune

If you fine-tuned via OpenAI’s API, deployment is automatic:

response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:your-org::abc123",
    messages=[{"role": "user", "content": "Your query"}]
)

Best for: No infrastructure to manage, low-to-medium volume Cost: $0.30/$1.20 per million input/output tokens (GPT-4o-mini) Latency: 200-500ms (network + inference)

The Cost Calculation

Here’s where fine-tuning pays off — or doesn’t.

Scenario: Customer Support Chatbot (10,000 queries/day)

Approach	Monthly Cost	Latency	Data Privacy
GPT-4o API	~$750-1,500/mo	300-500ms	Data sent to OpenAI
GPT-4o-mini API	~$150-300/mo	200-400ms	Data sent to OpenAI
Fine-tuned 7B (self-hosted)	~$150-250/mo (GPU)	50-150ms	Full control
Fine-tuned GPT-4o-mini (hosted)	~$100-200/mo	200-400ms	Data sent to OpenAI

At 10,000 queries/day, self-hosted fine-tuned models break even with GPT-4o-mini and cost 5-10x less than GPT-4o. Below 1,000 queries/day, the GPU rental isn’t worth it — use the API.

The break-even formula:

Break-even = GPU monthly cost / (API cost per query × queries per month)

If your GPU costs $200/month and each API call costs $0.01, break-even is 20,000 queries/month (~667/day). Below that, stick with the API.

The Hidden Savings

Cost-per-query isn’t the whole picture:

Shorter prompts — Fine-tuned models don’t need long system prompts or few-shot examples. That’s 500-2,000 fewer input tokens per call.
Smaller models — A fine-tuned 3B model can match an un-fine-tuned 70B. Smaller model = faster inference = more throughput per GPU.
No rate limits — Self-hosted models don’t hit API rate limits during traffic spikes.

✅ Quick Check: Your company processes 500 customer emails per day through GPT-4o. Each costs ~$0.05 in API tokens. Is it worth fine-tuning a 7B model? (Probably not yet. 500 queries/day × $0.05 = $25/day = $750/month. A GPU costs $150-250/month — you’d save money. But factor in engineering time to build and maintain the pipeline. At 500/day, the OpenAI API with a fine-tuned GPT-4o-mini ($0.005/query = $75/month) might be the smarter move.)

Real-World SLM Use Cases

Fine-tuned small language models (3B-7B) are replacing much larger models in production. Here are proven patterns:

E-Commerce Customer Support

Setup: Mistral 7B fine-tuned on 2,000 real support conversations Result: 90% cost reduction vs GPT-3.5 API, 75% of tickets handled autonomously Why it works: Support conversations follow predictable patterns. A specialist model handles them better than a generalist.

Code Review Pipeline

Setup: Llama 3.2 3B fine-tuned on internal codebase and review comments Result: Runs locally, zero data leaves the company network Why it works: Privacy requirements prevent sending proprietary code to external APIs. Fine-tuned local model solves both privacy and cost.

Medical Document Extraction

Setup: Phi-3 Mini (3.8B) fine-tuned on clinical notes Result: Processes thousands of records per hour on standard server hardware, HIPAA-compliant Why it works: Structured extraction from domain-specific documents is exactly what fine-tuning excels at — consistent format, predictable input patterns.

The SLM Market Context

Fine-tuned SLMs aren’t a niche anymore. The SLM market hit $6.5 billion in 2024 and is projected to reach $20.7 billion by 2030. Edge deployment — running models on phones, IoT devices, and on-premise servers — is the primary driver.

A fine-tuned Qwen3-4B now matches GPT-OSS-120B (a model 30x larger) on 7 out of 8 benchmarks. Specialist beats generalist.

Monitoring in Production

Deploying is step one. Keeping it running well is the ongoing work.

What to monitor:

Metric	How to Measure	Alert Threshold
Output quality	LLM-as-judge on random sample (weekly)	Score drops >10% from baseline
Format compliance	Automated schema validation	Below 95% valid outputs
Latency	P50, P95, P99 response times	P95 > 500ms
Throughput	Requests per second	Drops below capacity plan
User feedback	Thumbs up/down, escalation rate	Satisfaction drops below baseline

Retraining cadence: Most production models benefit from quarterly retraining. Collect new examples from production (especially cases where the model struggled), add them to the training set, retrain, evaluate against the previous version.

Course Review

Here’s what you’ve learned across 8 lessons:

Lesson	Key Concept
1. Decision Framework	Fine-tune for behavior/style, RAG for knowledge, prompt engineering first
2. Methods	SFT for instruction-following, DPO for preferences, RLHF for frontier labs
3. LoRA & QLoRA	99.6% fewer parameters, 7B model in 8 GB VRAM, adapters are portable
4. Tools	Unsloth for speed, Axolotl for production, OpenAI API for no-GPU
5. Datasets	Quality > quantity, 1,000 curated > 10,000 noisy, always hold out a test set
6. First Run	QLoRA + Unsloth + Colab T4 = free fine-tuning in 15 minutes
7. Evaluation	Held-out metrics + LLM-as-judge + A/B comparisons, iterate 2-4 times
8. Production	Merge adapters, self-host above 1,000 queries/day, monitor for drift

Your Next Steps

Pick a real task — Something you’d actually use. Customer support tone? Code review? Document extraction?
Build 200-500 quality examples — This is where you’ll spend most of your time. Make them count.
Run the Colab notebook — Lesson 6’s workflow. Swap in your data.
Evaluate honestly — Use the framework from Lesson 7. Don’t ship until held-out metrics beat the baseline.
Deploy small — Start with the OpenAI fine-tuning API or a single GPU. Scale up only when usage justifies it.

The gap between “I’ve never fine-tuned a model” and “I have a fine-tuned model in production” is smaller than you think. The hard part isn’t the training — it’s the data.

Key Takeaways

Three deployment paths: merged model on GPU (lowest latency), adapter hot-swapping (multi-task), hosted API (no infrastructure)
Self-hosted fine-tuned models break even at ~1,000 queries/day vs API pricing
Fine-tuned 3B-7B models replace 70B+ base models in production — specialist beats generalist
SLM market is growing 25%+ annually, driven by edge and on-premise deployment
Monitor output quality, format compliance, and latency continuously — retrain quarterly with new production data
The hardest part of fine-tuning isn’t the training code — it’s building a high-quality dataset

Knowledge Check

1. What's the advantage of merging LoRA adapters into the base model before deployment?

It makes the model smaller Zero added latency — the merged model runs at the same speed as the original base model, with no adapter loading overhead It improves model accuracy

2. A startup replaces GPT-4o API calls with a fine-tuned Mistral 7B for customer support. What's the typical cost reduction?

About 10-20% About 50% 80-90% — self-hosted inference on a fine-tuned 7B model costs a fraction of per-token API pricing for high-volume use cases

3. Your fine-tuned model has been in production for 3 months. Output quality is declining. What's happening?

Model drift — the real-world inputs have shifted away from your training distribution, and the model is encountering patterns it wasn't trained on The model weights are degrading over time GPU memory is leaking

Answer all questions to check

Complete the quiz above first