Gemini 3.1 Flash TTS

Google · Audio Generation
POST /v1/audio/speechHighly controllable TTS with new Audio Tags for precise style, tone, pace, and delivery across narration, assistants, and voice apps.
At a glance
Pricing
Example request
Parameters
Notes
Most controllable Gemini TTS to date.
Limits
- Text + style prompt: 4,000 bytes each (8,000 combined)
- Max output: ~10 minutes
- Audio billing: ~25 tokens per second (~15 chars/s)
- Language is auto-detected; the language setting is a hint, not a constraint
Inline audio tags (control delivery)
- Emotion:
[whispers],[shouts],[laughs],[sighs],[cheerful],[sad],[angry], etc. - Pace:
[slow],[fast],[extremely fast],[normal pace] - Pauses:
[short pause],[long pause],[breath] - Emphasis:
[softly],[loudly],[high pitch],[low pitch],[rising tone],[falling tone]
Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/gemini-3-1-flash-tts.
