Production Deployment and Security
Harden your local AI for production: Docker containerization, network security, fine-tuning with LoRA, multi-user access, monitoring, and backup strategies.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
Your local AI works on your laptop. Now make it work for your team — reliably, securely, and at scale.
🔄 Quick Recall: In the previous lesson, you learned the compliance frameworks (GDPR, HIPAA, EU AI Act) that govern AI deployments. Now you’ll implement the technical controls those frameworks require.
This lesson covers the production engineering: containerization, network security, fine-tuning, multi-user access, and monitoring.
Containerized Deployment with Docker
Running Ollama directly on a server works for personal use. For production, use Docker:
# docker-compose.yml
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
read_only: true
security_opt:
- no-new-privileges:true
volumes:
ollama_data:
Why Docker for production:
- Isolation — the AI service can’t access your host filesystem
- Resource limits — control exactly how much CPU, RAM, and GPU the container uses
- Reproducibility — same container image deploys identically everywhere
- Restart policy — automatic recovery from crashes
- Security — read-only filesystem, no privilege escalation
After starting the container, load your model:
docker exec -it ollama ollama pull llama3.1
✅ Quick Check: Your Docker container running Ollama crashes at 2 AM. With the
restart: unless-stoppedpolicy, what happens? (Answer: Docker automatically restarts the container. The model data is preserved in theollama_datavolume, so no re-downloading is needed. The service comes back online without human intervention. This is why Docker is preferred over running Ollama directly — built-in crash recovery.)
Network Security
Your local AI shouldn’t be accessible from the internet. Here’s how to lock it down:
Bind to Internal Network Only
# Instead of binding to 0.0.0.0 (all interfaces)
OLLAMA_HOST=192.168.1.100:11434 ollama serve
Reverse Proxy with Authentication
Put nginx or Caddy in front of Ollama to add authentication:
# nginx.conf
server {
listen 443 ssl;
server_name ai.internal.company.com;
ssl_certificate /etc/ssl/certs/ai.crt;
ssl_certificate_key /etc/ssl/private/ai.key;
location / {
auth_basic "AI Service";
auth_basic_user_file /etc/nginx/.htpasswd;
proxy_pass http://localhost:11434;
}
}
Firewall Rules
# Allow only internal subnet
iptables -A INPUT -p tcp --dport 11434 -s 192.168.1.0/24 -j ACCEPT
iptables -A INPUT -p tcp --dport 11434 -j DROP
Security checklist:
- AI service bound to internal IP, not 0.0.0.0
- HTTPS for all connections (even internal)
- Authentication required for API access
- Firewall restricts access to known IP ranges
- No direct internet access from the AI server
Fine-Tuning with LoRA
When RAG isn’t enough — your model needs to adopt domain-specific behavior, not just access domain-specific data — fine-tuning is the answer.
What LoRA Does
LoRA (Low-Rank Adaptation) freezes the original model and trains a small set of additional parameters. The result: a 50-200MB adapter file that modifies the model’s behavior without changing the multi-gigabyte base model.
When to Fine-Tune
- The model consistently uses wrong terminology for your domain
- You need a specific output format that prompting can’t reliably produce
- You want the model to adopt your company’s communication style
- RAG retrieves the right information but the model interprets it poorly
When NOT to Fine-Tune (Use RAG Instead)
- You need the model to know specific facts (use RAG)
- Your knowledge base changes frequently (use RAG)
- You need source attribution (RAG naturally provides citations)
QLoRA: Fine-Tuning on Consumer Hardware
QLoRA combines LoRA with 4-bit quantization, making fine-tuning possible on consumer GPUs:
| Hardware | Model Size | Technique |
|---|---|---|
| 8GB VRAM | Up to 7B | QLoRA |
| 12GB VRAM | Up to 13B | QLoRA |
| 24GB VRAM | Up to 34B | QLoRA |
| 48GB+ VRAM | 70B | LoRA or QLoRA |
Basic QLoRA workflow:
- Prepare training data (instruction-response pairs, typically 500-2000 examples)
- Choose a base model (Llama 3.1 8B is a solid starting point)
- Run fine-tuning with a library like
unslothoraxolotl - Export the adapter to GGUF format
- Load in Ollama with a custom Modelfile
# Modelfile for fine-tuned model
FROM llama3.1
ADAPTER ./my-lora-adapter.gguf
SYSTEM "You are a customer support agent for Acme Corp."
✅ Quick Check: You fine-tuned a model on company data. An employee leaves the company and requests their data be deleted (GDPR right to erasure). Can you delete their data from the fine-tuned model? (Answer: This is one of the hardest problems in AI compliance. Unlike RAG — where you can delete documents from the vector database — fine-tuned knowledge is baked into model weights and can’t be selectively removed. Options: retrain without that person’s data, or document in your DPIA that fine-tuned models may retain residual knowledge. This is a strong argument for preferring RAG over fine-tuning when processing personal data.)
Multi-User Access Patterns
Pattern 1: Shared Server
One Ollama instance serves multiple users via the API. Simplest setup.
Employee laptops → Internal network → Ollama server → Response
Pros: One model loaded, shared GPU, central management Cons: Concurrent requests queue up, no user isolation
Pattern 2: Open WebUI
Deploy Open WebUI as a frontend for Ollama. Provides per-user accounts, conversation history, and admin controls.
# Add to docker-compose.yml
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "3000:8080"
environment:
- OLLAMA_BASE_URL=http://ollama:11434
volumes:
- open-webui_data:/app/backend/data
Gives each user a ChatGPT-like interface with their own conversation history, while all processing stays on your server.
Pattern 3: API Gateway
For application integration, put an API gateway in front of Ollama to handle:
- Rate limiting per user/team
- Request logging for audit
- Input/output filtering
- API key management
Monitoring and Observability
Production AI needs monitoring:
What to Monitor
- Response latency — how long queries take (alert if > 10 seconds)
- Token throughput — tokens per second (tracks hardware utilization)
- Error rate — failed requests (hardware issues, OOM errors)
- Memory usage — RAM and VRAM utilization
- Disk space — model storage and vector database growth
- Query volume — requests per hour/day (capacity planning)
Audit Logging
For compliance, log every interaction:
{
"timestamp": "2026-02-24T14:30:00Z",
"user": "jane.doe@company.com",
"model": "llama3.1",
"prompt_length": 1250,
"response_length": 430,
"latency_ms": 2340,
"status": "success"
}
Don’t log the actual prompt/response content unless your policy explicitly allows it — that creates a sensitive data store that itself needs protection.
Backup and Recovery
What to Back Up
- Model files — large but can be re-downloaded (lower priority)
- LoRA adapters — small and can’t be re-downloaded (high priority)
- Vector database — your indexed documents (high priority)
- Configuration — Modelfiles, docker-compose, nginx configs (critical)
- Audit logs — compliance requirement (critical)
Backup Strategy
# Daily: configuration and adapters
tar -czf backup-config-$(date +%Y%m%d).tar.gz \
docker-compose.yml Modelfile nginx.conf
# Weekly: vector database
tar -czf backup-vectordb-$(date +%Y%m%d).tar.gz ./chromadb/
# Monthly: full snapshot including models
Practice Exercise
- Create a
docker-compose.ymlfor your local AI setup (Ollama + Open WebUI) - Configure the network to bind only to your internal IP
- Set up basic authentication using nginx or Open WebUI’s user system
- Create a monitoring checklist: what metrics will you track?
- Write a one-page backup plan for your AI stack
Key Takeaways
- Docker containers provide isolation, resource limits, and automatic restart for production deployments
- Lock down network access: bind to internal IPs, use HTTPS, require authentication, configure firewalls
- LoRA/QLoRA enables fine-tuning on consumer GPUs — but prefer RAG for personal data (easier to delete)
- Open WebUI adds multi-user access with per-user accounts and conversation history
- Monitor latency, throughput, errors, and memory — log queries for audit but consider what you’re storing
- Back up adapters, vector databases, and configurations — models can be re-downloaded
Up Next
In the final lesson, you’ll combine everything into a complete private AI stack — from model selection through RAG, compliance, and production deployment — and get a roadmap for continued learning.