3-Second Voice Cloning • 97ms Latency • 10 Languages
Qwen3-TTS is an advanced open-source text-to-speech model series developed by the Qwen team at Alibaba Cloud. Trained on over 5 million hours of speech data, Qwen3-TTS delivers state-of-the-art voice cloning, natural language voice design, and ultra-low latency streaming for real-time applications.
Qwen3-TTS combines cutting-edge architecture with practical features for production-ready speech synthesis.
Clone any voice from just 3 seconds of reference audio using state-of-the-art zero-shot technology. Capture unique vocal characteristics with unprecedented accuracy.
Zero-shot capabilityCreate entirely new voices using natural language descriptions. Control timbre, emotion, accent, and speaking style through intuitive text prompts.
Description-based controlAchieve 97ms first-packet latency with the 0.6B model, enabling real-time streaming for conversational AI and interactive applications.
97ms first packetGenerate natural speech across 10 major languages including Chinese, English, Japanese, Korean, German, French, Spanish, and more.
10 languagesChoose between the 1.7B model for maximum quality and expressiveness, or the 0.6B model for efficiency and real-time streaming applications.
0.6B & 1.7B variantsReleased under the Apache 2.0 license, enabling unrestricted commercial use, modification, and self-hosted deployment without licensing fees.
Apache 2.0Built on a dual-track language model architecture with innovative speech tokenizers for optimal performance.
Dual-Track Language Model
Qwen-TTS-Tokenizer-12Hz
5 Million+ Hours
10 Major Languages
97ms (0.6B Model)
Apache 2.0
| Model | Parameters | Primary Use Case | First-Packet Latency |
|---|---|---|---|
Qwen3-TTS-12Hz-1.7B-Base |
1.7 Billion | High-quality synthesis | 101ms |
Qwen3-TTS-12Hz-0.6B-Base |
0.6 Billion | Low-latency streaming | 97ms |
Qwen3-TTS-12Hz-1.7B-VoiceDesign |
1.7 Billion | Voice design & cloning | 101ms |
Qwen3-TTS outperforms leading commercial and open-source models in key metrics.
Qwen3-TTS enables innovative solutions across industries with its advanced speech synthesis capabilities.
Power conversational AI with ultra-low latency responses. The 97ms streaming enables natural, real-time dialogue for voice assistants, customer service bots, and smart home devices.
Generate professional narration for audiobooks, podcasts, video voiceovers, and documentaries. Create unique character voices for audio dramas and animated content.
Improve screen reader experiences with natural, engaging speech. Provide multilingual audio for educational materials and support learners with personalized voices.
Generate dynamic NPC dialogue in real-time. Create interactive storytelling experiences with context-aware speech that adapts to player actions and game states.
Get started with Qwen3-TTS in minutes using the official Python package.
# Install the qwen-tts package pip install -U qwen-tts # Basic text-to-speech generation from qwen_tts import Qwen3TTSModel import torch import soundfile as sf # Load the model model = Qwen3TTSModel.from_pretrained( "Qwen/Qwen3-TTS-12Hz-0.6B-Base", device_map="auto", torch_dtype=torch.bfloat16 ) # Generate speech text = "Welcome to Qwen3-TTS, an advanced text-to-speech system." wavs, sr = model(text) # Save the output sf.write("output.wav", wavs, sr)
Access official documentation, models, and research materials.