Core Capabilities

Qwen3-TTS combines cutting-edge architecture with practical features for production-ready speech synthesis.

🎤

3-Second Voice Cloning

Clone any voice from just 3 seconds of reference audio using state-of-the-art zero-shot technology. Capture unique vocal characteristics with unprecedented accuracy.

Zero-shot capability

✎

Natural Language Voice Design

Create entirely new voices using natural language descriptions. Control timbre, emotion, accent, and speaking style through intuitive text prompts.

Description-based control

⚡

Ultra-Low Latency

Achieve 97ms first-packet latency with the 0.6B model, enabling real-time streaming for conversational AI and interactive applications.

97ms first packet

🌐

Multilingual Support

Generate natural speech across 10 major languages including Chinese, English, Japanese, Korean, German, French, Spanish, and more.

10 languages

📈

Flexible Model Sizes

Choose between the 1.7B model for maximum quality and expressiveness, or the 0.6B model for efficiency and real-time streaming applications.

0.6B & 1.7B variants

🔓

Open Source

Released under the Apache 2.0 license, enabling unrestricted commercial use, modification, and self-hosted deployment without licensing fees.

Apache 2.0

Technical Specifications

Built on a dual-track language model architecture with innovative speech tokenizers for optimal performance.

Architecture

Dual-Track Language Model

Speech Encoder

Qwen-TTS-Tokenizer-12Hz

Training Data

5 Million+ Hours

Languages Supported

10 Major Languages

Minimum Latency

97ms (0.6B Model)

License

Apache 2.0

Model Variants

Model	Parameters	Primary Use Case	First-Packet Latency
`Qwen3-TTS-12Hz-1.7B-Base`	1.7 Billion	High-quality synthesis	101ms
`Qwen3-TTS-12Hz-0.6B-Base`	0.6 Billion	Low-latency streaming	97ms
`Qwen3-TTS-12Hz-1.7B-VoiceDesign`	1.7 Billion	Voice design & cloning	101ms

Performance Benchmarks

Qwen3-TTS outperforms leading commercial and open-source models in key metrics.

0.77

WER (Chinese)

1.24

WER (English)

0.829

Speaker Similarity

Cross-Lingual Cloning

Applications

Qwen3-TTS enables innovative solutions across industries with its advanced speech synthesis capabilities.

🤖 Virtual Assistants

Power conversational AI with ultra-low latency responses. The 97ms streaming enables natural, real-time dialogue for voice assistants, customer service bots, and smart home devices.

🎥 Content Creation

Generate professional narration for audiobooks, podcasts, video voiceovers, and documentaries. Create unique character voices for audio dramas and animated content.

📚 Accessibility

Improve screen reader experiences with natural, engaging speech. Provide multilingual audio for educational materials and support learners with personalized voices.

🎮 Entertainment & Gaming

Generate dynamic NPC dialogue in real-time. Create interactive storytelling experiences with context-aware speech that adapts to player actions and game states.

Quick Start

Get started with Qwen3-TTS in minutes using the official Python package.

Installation & Basic Usage

# Install the qwen-tts package
pip install -U qwen-tts

# Basic text-to-speech generation
from qwen_tts import Qwen3TTSModel
import torch
import soundfile as sf

# Load the model
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-0.6B-Base",
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# Generate speech
text = "Welcome to Qwen3-TTS, an advanced text-to-speech system."
wavs, sr = model(text)

# Save the output
sf.write("output.wav", wavs, sr)

Qwen3-TTS: Next-Generation Open-Source Text-to-Speech