How to Use Voxtral TTS

From first click to production-ready audio in under a minute. No account, no API key, no technical setup.

Quick Start

Three steps. Thirty seconds. Professional audio.

Enter your text

Type or paste text up to 2,000 characters. Voxtral TTS handles punctuation-aware pacing automatically.

Choose a voice

Select a preset voice or upload a short audio clip for instant voice cloning across 9 languages.

Generate and download

Click Generate, preview in-browser, then download as MP3 or WAV for immediate use.

Voice Cloning Walkthrough

Clone any voice from a short audio clip. No training, no fine-tuning — the model captures identity in a single pass.

Prepare

Record or Select a Voice Clip

Choose a 3–25 second audio sample in MP3, WAV, or WEBM. Clean speech with minimal background noise produces the best clones. Conversational recordings with natural rhythm outperform scripted studio reads.

Upload

Upload to the Voice Cloning Panel

Drag your file into the upload area or click to browse. Voxtral analyzes accent, pitch, speaking pace, and emotional texture — no fine-tuning or training loop required.

Generate

Enter Text and Synthesize

Type any text in any of the 9 supported languages. The model applies the cloned voice's characteristics to the new content while adapting pronunciation to the target language.

Refine

Iterate and Download

Preview the output, adjust punctuation or phrasing for better pacing, and regenerate if needed. Download production-ready audio in MP3 or WAV format.

Audio Output Formats

Voxtral TTS supports multiple output formats. Choose based on your use case.

Format	Best For	File Size	Quality
MP3	Podcasts, web embedding, social media, email campaigns	Smaller (~1 MB/min)	High — suitable for most applications
WAV	Video production, post-processing, professional editing	Larger (~10 MB/min)	Lossless — maximum fidelity
PCM (Streaming)	Real-time voice agents, live conversational AI	Streamed — no file	Raw — ~0.8s time-to-first-audio

Tips for Better Results

Small changes in input make big differences in output. These patterns consistently produce natural-sounding speech.

Use Punctuation for Pacing

Commas create brief pauses. Periods add longer breaks. Em dashes (—) insert dramatic hesitation. Voxtral reads punctuation as performance cues.

Match Reference Length

For voice cloning, 5–10 seconds of reference audio is the sweet spot. Too short loses nuance; too long adds noise.

Conversational Tone Wins

The model reproduces natural disfluencies — "ums," pauses, breath. Conversational clips clone better than polished studio reads.

Try Cross-Lingual Cloning

Upload a French voice, generate German speech. Voxtral preserves speaker identity across all 9 supported languages.

Emotion Follows Context

Write excited text and the voice sounds excited. Write a question and intonation rises naturally. No prosody tags needed.

Preview Before Downloading

Always preview in-browser first. If a sentence sounds off, tweak the punctuation or rephrase — regeneration is instant.

Supported Languages

Voxtral TTS supports 9 languages with native pronunciation and cross-lingual voice cloning.

EnglishFrenchSpanishPortugueseItalianDutchGermanHindiArabic

Cross-lingual cloning: clone a voice in one language, generate speech in another — speaker identity is preserved.

Ready to hear it? Generate your first audio clip now.