Updated May 15, 2026 — Six weeks after the March 26 launch wave, the open-source TTS field has expanded significantly. Chatterbox-Turbo (Resemble AI, MIT license) and CosyVoice2-0.5B are now serious top-3 contenders in the r/LocalLLaMA community discussions. Kokoro at 82M parameters has become the dominant choice for edge/efficiency. Resemble AI’s blind test reports listeners preferred Chatterbox-Turbo 65.3% of the time vs ElevenLabs 24.5% — the first vendor-run blind test where open-source beats ElevenLabs by a wide margin. See the new “New Entrants (April-May 2026)” section below.
March 26, 2026, was a landmark day for open-source voice AI. Three models dropped within hours of each other: Mistral’s Voxtral TTS, Cohere’s Transcribe, and Tencent’s CoVo-Audio. One Redditor on r/LocalLLaMA summed it up: “The on-prem voice stack is here.”
That stack matters because it changes the economics. ElevenLabs charges $5-$1,300/month. These models are free to download and run on your own hardware. Some run on a smartphone.
Here are the 5 best open-source TTS models you can actually self-host right now, ranked by quality.
Quick Comparison
| Model | Quality Rank | Languages | Min Hardware | Voice Cloning | License |
|---|---|---|---|---|---|
| Voxtral TTS | #1 | 9 | 3GB RAM / 16GB GPU | 3 seconds | CC (open) |
| Bark | #2 | 13+ | 12GB GPU | No | MIT |
| Coqui XTTS v2 | #3 | 17 | 8GB GPU | 6 seconds | MPL 2.0 |
| Piper | #4 | 30+ | CPU only | No | MIT |
| MetaVoice | #5 | 1 (EN) | 8GB GPU | 30 seconds | Apache 2.0 |
1. Voxtral TTS (Mistral) — The New Benchmark
Why it’s #1: In blind human evaluations, 62.8% of listeners preferred Voxtral over ElevenLabs Flash v2.5. It matches ElevenLabs’ premium v3 tier on emotional expressiveness while being completely free.
- Parameters: 4B
- Languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, Arabic
- Voice cloning: 3 seconds of reference audio
- Latency: 90ms time-to-first-audio
- Self-hosting: 3GB RAM minimum, 16GB GPU recommended for production
- API option: $0.016/1K characters via Mistral’s API
- License: Creative Commons
Best for: Production-quality voice in European languages. The 3-second voice cloning is the fastest in this list. If you need to spin up a custom voice quickly, nothing else comes close.
Limitations: Only 9 languages (no CJK). Launched 2 days ago — the ecosystem is thin. No dubbing, no translation pipeline.
For a detailed comparison with ElevenLabs, see our Voxtral TTS vs ElevenLabs guide.
2. Bark (Suno) — Most Expressive
Why it’s #2: Bark generates not just speech but laughter, sighs, music, and environmental sounds. It’s the most “human-sounding” model for emotional content — narration, audiobooks, character voices.
- Developer: Suno (yes, the AI music company)
- Languages: 13+ (including Chinese, Japanese, Korean)
- Voice cloning: Not built-in (community workarounds exist)
- Quality: High naturalness, sometimes unpredictable
- Self-hosting: 12GB GPU recommended
- License: MIT
Best for: Audiobook narration, creative content, anything where emotional range matters. Bark can laugh in the middle of a sentence. Try that with any other model.
Limitations: Slow generation speed. No streaming support. Can produce unexpected sounds or pauses. Not ideal for real-time applications.
3. Coqui XTTS v2 — Best Multilingual
Why it’s #3: 17 languages with solid quality across all of them. Voice cloning from 6 seconds. The most battle-tested open-source TTS for production multilingual applications.
- Languages: 17 (including Chinese, Japanese, Korean, Arabic, Turkish, Russian, Polish)
- Voice cloning: 6 seconds of reference audio
- Quality: Good to very good, consistent across languages
- Self-hosting: 8GB GPU
- License: MPL 2.0 (commercial use allowed with conditions)
Best for: Multilingual applications. If you need Chinese, Japanese, Korean, Turkish, or Russian TTS with voice cloning, Coqui is currently the best open-source option. Voxtral doesn’t cover these languages yet.
Limitations: Coqui (the company) shut down in 2024, so the model isn’t actively maintained. Community forks exist, but don’t expect new features. Quality trails Voxtral and Bark on English.
4. Piper — Lightest, Fastest, Most Languages
Why it’s #4: Runs on a Raspberry Pi. Supports 30+ languages. Generates speech in real-time on CPU alone. If you need TTS on edge devices or embedded systems, Piper is the answer.
- Languages: 30+ (widest coverage in this list)
- Voice cloning: No
- Quality: Good (not great — optimized for speed over quality)
- Self-hosting: CPU only — no GPU needed. Runs on Raspberry Pi, Android, embedded Linux
- License: MIT
Best for: IoT devices, home automation, accessibility tools, low-power applications. When you need TTS that runs everywhere, including a $35 computer.
Limitations: No voice cloning. Quality is noticeably below the top three — sounds more “synthesized” and less natural. Pre-built voices only.
5. MetaVoice — Best English-Only Quality
Why it’s #5: Very high English quality with emotional tone control. Zero-shot voice cloning from 30 seconds. If your use case is English-only and quality is everything, MetaVoice deserves a look.
- Developer: MetaVoice (acquired by ElevenLabs in 2024)
- Languages: English only
- Voice cloning: 30 seconds
- Quality: Excellent for English
- Self-hosting: 8GB GPU
- License: Apache 2.0
Best for: English podcasts, voiceovers, narration where you want maximum quality and don’t need other languages.
Limitations: English only. The 30-second cloning requirement is much higher than Voxtral (3 seconds) or Coqui (6 seconds). MetaVoice was acquired by ElevenLabs, so future open-source development is uncertain.
New Entrants (April-May 2026): The Models the Community Rallied Around
Six weeks after launch day, four newer models have joined the conversation — and based on r/LocalLLaMA threads through May, three of them now belong in any serious open-source TTS shortlist.
Chatterbox-Turbo (Resemble AI) — The Cloning + Quality Pick
The model that most consistently wins community “best overall” votes. Notable for actually beating ElevenLabs in a vendor-run blind test.
- Parameters: 350M (Turbo); ~500M (full Chatterbox)
- Languages: 23+ (via MTL training); strongest in English
- Voice cloning: Zero-shot from ~5-10 seconds of reference audio
- Latency: ~75 ms, 6× real-time on a single consumer GPU
- Min hardware: Single consumer GPU (RTX 30/40-series comfortable, 4-16 GB VRAM)
- License: MIT — most permissive on this list
- Standout feature: Paralinguistic tags
[laugh], pacing controls, emotion control - GitHub: resemble-ai/chatterbox
- The blind-test result: 65.3% of listeners preferred Chatterbox-Turbo, 24.5% preferred ElevenLabs, 10.2% neutral — per Resemble AI’s published listening study. Vendor-run, so take with the usual grain of salt, but it’s the most striking open-vs-closed result of 2026 so far.
CosyVoice2-0.5B (FunAudioLLM) — The Streaming/Multilingual Pick
The best choice if you need low-latency real-time output in 9+ languages.
- Parameters: 500M (0.5B)
- Languages: 9 base (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian) + 18+ Chinese dialects. EU community fork extends FR/DE.
- Voice cloning: Multilingual zero-shot, including cross-lingual (clone in English, speak Japanese)
- Latency: ~150 ms streaming first-packet — class-leading
- Min hardware: ~8 GB VRAM (runs on Jetson-class edge devices)
- License: Custom (commercial use generally OK with attribution; verify the LICENSE file before production)
- Pronunciation: 30-50% fewer pronunciation errors vs CosyVoice1
- Hugging Face: FunAudioLLM/CosyVoice2-0.5B
Kokoro (82M) — The Efficiency King
The default “just give me TTS that works on a Mac” pick. Frequently called “no competition” for speed/edge use in community threads.
- Parameters: 82M
- Languages: 8 (English + Japanese, Chinese, several European)
- Voices: 54 presets across the language set (no arbitrary voice cloning — uses preset voices)
- Latency: Blazing — community reports 90-210× real-time on decent hardware
- Min hardware: ~2-3 GB VRAM at FP16; weights under 1 GB; runs on CPU and Mac M-series perfectly well
- License: Open (check repo for current terms)
- GitHub: hexgrad/kokoro
- The catch: No zero-shot cloning. If you need a custom voice, look at Chatterbox-Turbo, CosyVoice2, or stick with the existing Voxtral / XTTS picks above.
Fish Speech V1.5 and IndexTTS-2 (worth knowing about)
Two more entrants worth flagging without full deep-dives:
- Fish Speech V1.5 — Large DualAR Transformer, English/Chinese/Japanese-focused, TTS Arena ELO ~1339 (top-tier). HF: fishaudio/fish-speech-1.5.
- IndexTTS-2 / IndexTTS-2.5 — English-centric, zero-shot, strong emotion control with disentangled speaker identity (you can mix one speaker’s timbre with another’s emotion). License is more restrictive (non-commercial without contact). GitHub.
Community Use-Case Cheat Sheet (r/LocalLLaMA May 2026 consensus)
| Use case | Top pick | Why |
|---|---|---|
| Speed / low VRAM / edge / Mac | Kokoro | 90-210× real-time, CPU-viable, <1GB weights |
| Voice cloning / emotion / quality | Chatterbox-Turbo | Beats ElevenLabs in blind test, MIT, 5-10s ref |
| Real-time streaming | Kokoro or CosyVoice2 | 150ms first-packet on CosyVoice |
| Multilingual breadth | Voxtral, Chatterbox, or CosyVoice2 | 9-23+ languages |
| Long-form / production stability | Chatterbox or MOSS-TTS | Stability + paralinguistic control |
| Free with permissive license | Kokoro, Chatterbox (MIT), Piper (MIT) | Ship in commercial products |
The dominant pattern in May 2026 community threads is stacking, not picking: Kokoro for fast pipeline outputs, Chatterbox-Turbo when premium quality matters, Voxtral when you need 9-language emotional expressiveness with the polish of a frontier model. The “one TTS model rules them all” era is clearly over.
How to Choose
| If you need… | Use this |
|---|---|
| Best overall quality + voice cloning | Voxtral TTS |
| CJK language support + voice cloning | Coqui XTTS v2 |
| Emotional narration + sound effects | Bark |
| Edge devices / no GPU / 30+ languages | Piper |
| Best English-only quality | MetaVoice |
| Privacy / data sovereignty (never leaves your network) | Any of the above (all self-hostable) |
The Elephant in the Room: ElevenLabs
All five of these models are free. ElevenLabs starts at $5/month and scales to $1,300+. So why would anyone pay?
ElevenLabs still leads on:
- 32 languages (vs Voxtral’s 9 or Coqui’s 17)
- Professional dubbing and translation pipelines
- Thousands of pre-made voices
- Enterprise SLA and support
- Integrations with Descript, Canva, and dozens of tools
- The most polished UX — no command line needed
But the gap is closing fast. Voxtral beat ElevenLabs’ Flash tier in blind tests in March, and Chatterbox-Turbo (Resemble AI) released a follow-up blind test where 65.3% of listeners preferred it over ElevenLabs vs 24.5% the other way. That’s the first vendor-run blind result where an open-weight, MIT-licensed model beat ElevenLabs by a substantial margin. The privacy angle (self-hosted = no data leaves your network) is a genuine enterprise requirement, not a nice-to-have. And for developers building voice features into products, “free model weights” vs “$99-$1,300/month API” is a simple calculation.
The on-premise voice AI stack is real. March 26 proved it. May 7 (when Chatterbox-Turbo’s blind-test results circulated) cemented it.
Sources:
Original 5 (March 2026):
- Speaking of Voxtral — Mistral AI Blog
- Voxtral TTS Model — Hugging Face
- Bark — Suno GitHub
- Coqui XTTS — GitHub
- Piper — rhasspy GitHub
- MetaVoice — GitHub
- Voxtral TTS vs ElevenLabs — FindSkill.ai
Added in May 2026 update:
- Chatterbox-Turbo — Resemble AI
- Chatterbox GitHub
- CosyVoice2-0.5B — Hugging Face
- Kokoro-82M — GitHub
- Fish Speech v1.5 — Hugging Face
- IndexTTS — GitHub
- Speech AI Leaderboard 2026 — CodeSOTA
- Mistral releases Voxtral — TechCrunch
- Awesome AI Voice — wildminder GitHub
- Best Open Source TTS Models — Modal blog
- r/LocalLLaMA discussions on Kokoro, Chatterbox, CosyVoice