Qwen3-TTS: Next-Generation Open-Source Text-to-Speech

3-Second Voice Cloning • 97ms Latency • 10 Languages

Qwen3-TTS is an advanced open-source text-to-speech model series developed by the Qwen team at Alibaba Cloud. Trained on over 5 million hours of speech data, Qwen3-TTS delivers state-of-the-art voice cloning, natural language voice design, and ultra-low latency streaming for real-time applications.

Apache 2.0 License By Alibaba Cloud 5M+ Hours Training Data

Core Capabilities

Qwen3-TTS combines cutting-edge architecture with practical features for production-ready speech synthesis.

🎤

3-Second Voice Cloning

Clone any voice from just 3 seconds of reference audio using state-of-the-art zero-shot technology. Capture unique vocal characteristics with unprecedented accuracy.

Zero-shot capability

Natural Language Voice Design

Create entirely new voices using natural language descriptions. Control timbre, emotion, accent, and speaking style through intuitive text prompts.

Description-based control

Ultra-Low Latency

Achieve 97ms first-packet latency with the 0.6B model, enabling real-time streaming for conversational AI and interactive applications.

97ms first packet
🌐

Multilingual Support

Generate natural speech across 10 major languages including Chinese, English, Japanese, Korean, German, French, Spanish, and more.

10 languages
📈

Flexible Model Sizes

Choose between the 1.7B model for maximum quality and expressiveness, or the 0.6B model for efficiency and real-time streaming applications.

0.6B & 1.7B variants
🔓

Open Source

Released under the Apache 2.0 license, enabling unrestricted commercial use, modification, and self-hosted deployment without licensing fees.

Apache 2.0

Technical Specifications

Built on a dual-track language model architecture with innovative speech tokenizers for optimal performance.

Architecture

Dual-Track Language Model

Speech Encoder

Qwen-TTS-Tokenizer-12Hz

Training Data

5 Million+ Hours

Languages Supported

10 Major Languages

Minimum Latency

97ms (0.6B Model)

License

Apache 2.0

Model Variants

Model Parameters Primary Use Case First-Packet Latency
Qwen3-TTS-12Hz-1.7B-Base 1.7 Billion High-quality synthesis 101ms
Qwen3-TTS-12Hz-0.6B-Base 0.6 Billion Low-latency streaming 97ms
Qwen3-TTS-12Hz-1.7B-VoiceDesign 1.7 Billion Voice design & cloning 101ms

Performance Benchmarks

Qwen3-TTS outperforms leading commercial and open-source models in key metrics.

0.77
WER (Chinese)
1.24
WER (English)
0.829
Speaker Similarity
#1
Cross-Lingual Cloning

Applications

Qwen3-TTS enables innovative solutions across industries with its advanced speech synthesis capabilities.

🤖 Virtual Assistants

Power conversational AI with ultra-low latency responses. The 97ms streaming enables natural, real-time dialogue for voice assistants, customer service bots, and smart home devices.

🎥 Content Creation

Generate professional narration for audiobooks, podcasts, video voiceovers, and documentaries. Create unique character voices for audio dramas and animated content.

📚 Accessibility

Improve screen reader experiences with natural, engaging speech. Provide multilingual audio for educational materials and support learners with personalized voices.

🎮 Entertainment & Gaming

Generate dynamic NPC dialogue in real-time. Create interactive storytelling experiences with context-aware speech that adapts to player actions and game states.

Quick Start

Get started with Qwen3-TTS in minutes using the official Python package.

Installation & Basic Usage

# Install the qwen-tts package
pip install -U qwen-tts

# Basic text-to-speech generation
from qwen_tts import Qwen3TTSModel
import torch
import soundfile as sf

# Load the model
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-0.6B-Base",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# Generate speech
text = "Welcome to Qwen3-TTS, an advanced text-to-speech system."
wavs, sr = model(text)

# Save the output
sf.write("output.wav", wavs, sr)

Resources

Access official documentation, models, and research materials.