How to Use Voxtral TTS
From first click to production-ready audio in under a minute. No account, no API key, no technical setup.
Quick Start
Three steps. Thirty seconds. Professional audio.
Enter your text
Type or paste text up to 2,000 characters. Voxtral TTS handles punctuation-aware pacing automatically.
Choose a voice
Select a preset voice or upload a short audio clip for instant voice cloning across 9 languages.
Generate and download
Click Generate, preview in-browser, then download as MP3 or WAV for immediate use.
Voice Cloning Walkthrough
Clone any voice from a short audio clip. No training, no fine-tuning — the model captures identity in a single pass.
Record or Select a Voice Clip
Choose a 3–25 second audio sample in MP3, WAV, or WEBM. Clean speech with minimal background noise produces the best clones. Conversational recordings with natural rhythm outperform scripted studio reads.
Upload to the Voice Cloning Panel
Drag your file into the upload area or click to browse. Voxtral analyzes accent, pitch, speaking pace, and emotional texture — no fine-tuning or training loop required.
Enter Text and Synthesize
Type any text in any of the 9 supported languages. The model applies the cloned voice's characteristics to the new content while adapting pronunciation to the target language.
Iterate and Download
Preview the output, adjust punctuation or phrasing for better pacing, and regenerate if needed. Download production-ready audio in MP3 or WAV format.
Audio Output Formats
Voxtral TTS supports multiple output formats. Choose based on your use case.
| Format | Best For | File Size | Quality |
|---|---|---|---|
| MP3 | Podcasts, web embedding, social media, email campaigns | Smaller (~1 MB/min) | High — suitable for most applications |
| WAV | Video production, post-processing, professional editing | Larger (~10 MB/min) | Lossless — maximum fidelity |
| PCM (Streaming) | Real-time voice agents, live conversational AI | Streamed — no file | Raw — ~0.8s time-to-first-audio |
Tips for Better Results
Small changes in input make big differences in output. These patterns consistently produce natural-sounding speech.
Use Punctuation for Pacing
Commas create brief pauses. Periods add longer breaks. Em dashes (—) insert dramatic hesitation. Voxtral reads punctuation as performance cues.
Match Reference Length
For voice cloning, 5–10 seconds of reference audio is the sweet spot. Too short loses nuance; too long adds noise.
Conversational Tone Wins
The model reproduces natural disfluencies — "ums," pauses, breath. Conversational clips clone better than polished studio reads.
Try Cross-Lingual Cloning
Upload a French voice, generate German speech. Voxtral preserves speaker identity across all 9 supported languages.
Emotion Follows Context
Write excited text and the voice sounds excited. Write a question and intonation rises naturally. No prosody tags needed.
Preview Before Downloading
Always preview in-browser first. If a sentence sounds off, tweak the punctuation or rephrase — regeneration is instant.
Supported Languages
Voxtral TTS supports 9 languages with native pronunciation and cross-lingual voice cloning.
Cross-lingual cloning: clone a voice in one language, generate speech in another — speaker identity is preserved.
Ready to hear it? Generate your first audio clip now.
