Capstone: Production Deployment & Use Cases
Deploy your fine-tuned model to production: adapter merging, serving options, cost analysis, monitoring, and real-world SLM use cases.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
🔄 You’ve built a fine-tuned model. It passes evaluation. Now the question every tutorial avoids: how do you actually deploy this thing? And more importantly — when does the math justify running your own model instead of calling an API?
This lesson covers the production side: deployment options, cost analysis, monitoring, and real-world use cases where fine-tuned SLMs replace much larger models.
Deployment Options
You have three paths from trained model to production:
Option 1: Merged Model on a GPU Server
Merge LoRA adapters into the base model, then serve with vLLM or TGI:
# Merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("production-model")
# Serve with vLLM (example command)
# python -m vllm.entrypoints.openai.api_server \
# --model production-model --port 8000
Best for: High-volume production, single task Cost: $1-3/hour GPU rental (RunPod, Lambda) Latency: 50-200ms (fast — no network hop to external API)
Option 2: Adapter Hot-Swapping
Keep adapters separate. Load the base model once, swap adapters per request:
Base model (loaded once, 3.5 GB in 4-bit)
├── /support → support-adapter.bin (50 MB)
├── /legal → legal-adapter.bin (80 MB)
└── /code → code-adapter.bin (60 MB)
Frameworks like LoRAX and vLLM support dynamic adapter loading. One GPU serves multiple specialized models.
Best for: Multi-task serving, A/B testing different adapters Cost: Same GPU, multiple models Latency: Slight overhead for adapter switching (~10-50ms)
Option 3: OpenAI’s Hosted Fine-Tune
If you fine-tuned via OpenAI’s API, deployment is automatic:
response = client.chat.completions.create(
model="ft:gpt-4o-mini-2024-07-18:your-org::abc123",
messages=[{"role": "user", "content": "Your query"}]
)
Best for: No infrastructure to manage, low-to-medium volume Cost: $0.30/$1.20 per million input/output tokens (GPT-4o-mini) Latency: 200-500ms (network + inference)
The Cost Calculation
Here’s where fine-tuning pays off — or doesn’t.
Scenario: Customer Support Chatbot (10,000 queries/day)
| Approach | Monthly Cost | Latency | Data Privacy |
|---|---|---|---|
| GPT-4o API | ~$750-1,500/mo | 300-500ms | Data sent to OpenAI |
| GPT-4o-mini API | ~$150-300/mo | 200-400ms | Data sent to OpenAI |
| Fine-tuned 7B (self-hosted) | ~$150-250/mo (GPU) | 50-150ms | Full control |
| Fine-tuned GPT-4o-mini (hosted) | ~$100-200/mo | 200-400ms | Data sent to OpenAI |
At 10,000 queries/day, self-hosted fine-tuned models break even with GPT-4o-mini and cost 5-10x less than GPT-4o. Below 1,000 queries/day, the GPU rental isn’t worth it — use the API.
The break-even formula:
Break-even = GPU monthly cost / (API cost per query × queries per month)
If your GPU costs $200/month and each API call costs $0.01, break-even is 20,000 queries/month (~667/day). Below that, stick with the API.
The Hidden Savings
Cost-per-query isn’t the whole picture:
- Shorter prompts — Fine-tuned models don’t need long system prompts or few-shot examples. That’s 500-2,000 fewer input tokens per call.
- Smaller models — A fine-tuned 3B model can match an un-fine-tuned 70B. Smaller model = faster inference = more throughput per GPU.
- No rate limits — Self-hosted models don’t hit API rate limits during traffic spikes.
✅ Quick Check: Your company processes 500 customer emails per day through GPT-4o. Each costs ~$0.05 in API tokens. Is it worth fine-tuning a 7B model? (Probably not yet. 500 queries/day × $0.05 = $25/day = $750/month. A GPU costs $150-250/month — you’d save money. But factor in engineering time to build and maintain the pipeline. At 500/day, the OpenAI API with a fine-tuned GPT-4o-mini ($0.005/query = $75/month) might be the smarter move.)
Real-World SLM Use Cases
Fine-tuned small language models (3B-7B) are replacing much larger models in production. Here are proven patterns:
E-Commerce Customer Support
Setup: Mistral 7B fine-tuned on 2,000 real support conversations Result: 90% cost reduction vs GPT-3.5 API, 75% of tickets handled autonomously Why it works: Support conversations follow predictable patterns. A specialist model handles them better than a generalist.
Code Review Pipeline
Setup: Llama 3.2 3B fine-tuned on internal codebase and review comments Result: Runs locally, zero data leaves the company network Why it works: Privacy requirements prevent sending proprietary code to external APIs. Fine-tuned local model solves both privacy and cost.
Medical Document Extraction
Setup: Phi-3 Mini (3.8B) fine-tuned on clinical notes Result: Processes thousands of records per hour on standard server hardware, HIPAA-compliant Why it works: Structured extraction from domain-specific documents is exactly what fine-tuning excels at — consistent format, predictable input patterns.
The SLM Market Context
Fine-tuned SLMs aren’t a niche anymore. The SLM market hit $6.5 billion in 2024 and is projected to reach $20.7 billion by 2030. Edge deployment — running models on phones, IoT devices, and on-premise servers — is the primary driver.
A fine-tuned Qwen3-4B now matches GPT-OSS-120B (a model 30x larger) on 7 out of 8 benchmarks. Specialist beats generalist.
Monitoring in Production
Deploying is step one. Keeping it running well is the ongoing work.
What to monitor:
| Metric | How to Measure | Alert Threshold |
|---|---|---|
| Output quality | LLM-as-judge on random sample (weekly) | Score drops >10% from baseline |
| Format compliance | Automated schema validation | Below 95% valid outputs |
| Latency | P50, P95, P99 response times | P95 > 500ms |
| Throughput | Requests per second | Drops below capacity plan |
| User feedback | Thumbs up/down, escalation rate | Satisfaction drops below baseline |
Retraining cadence: Most production models benefit from quarterly retraining. Collect new examples from production (especially cases where the model struggled), add them to the training set, retrain, evaluate against the previous version.
Course Review
Here’s what you’ve learned across 8 lessons:
| Lesson | Key Concept |
|---|---|
| 1. Decision Framework | Fine-tune for behavior/style, RAG for knowledge, prompt engineering first |
| 2. Methods | SFT for instruction-following, DPO for preferences, RLHF for frontier labs |
| 3. LoRA & QLoRA | 99.6% fewer parameters, 7B model in 8 GB VRAM, adapters are portable |
| 4. Tools | Unsloth for speed, Axolotl for production, OpenAI API for no-GPU |
| 5. Datasets | Quality > quantity, 1,000 curated > 10,000 noisy, always hold out a test set |
| 6. First Run | QLoRA + Unsloth + Colab T4 = free fine-tuning in 15 minutes |
| 7. Evaluation | Held-out metrics + LLM-as-judge + A/B comparisons, iterate 2-4 times |
| 8. Production | Merge adapters, self-host above 1,000 queries/day, monitor for drift |
Your Next Steps
- Pick a real task — Something you’d actually use. Customer support tone? Code review? Document extraction?
- Build 200-500 quality examples — This is where you’ll spend most of your time. Make them count.
- Run the Colab notebook — Lesson 6’s workflow. Swap in your data.
- Evaluate honestly — Use the framework from Lesson 7. Don’t ship until held-out metrics beat the baseline.
- Deploy small — Start with the OpenAI fine-tuning API or a single GPU. Scale up only when usage justifies it.
The gap between “I’ve never fine-tuned a model” and “I have a fine-tuned model in production” is smaller than you think. The hard part isn’t the training — it’s the data.
Key Takeaways
- Three deployment paths: merged model on GPU (lowest latency), adapter hot-swapping (multi-task), hosted API (no infrastructure)
- Self-hosted fine-tuned models break even at ~1,000 queries/day vs API pricing
- Fine-tuned 3B-7B models replace 70B+ base models in production — specialist beats generalist
- SLM market is growing 25%+ annually, driven by edge and on-premise deployment
- Monitor output quality, format compliance, and latency continuously — retrain quarterly with new production data
- The hardest part of fine-tuning isn’t the training code — it’s building a high-quality dataset
Knowledge Check
Complete the quiz above first
Lesson completed!