Gemini 3.1 Flash TTS | EmpirioLabs AI Docs

Google · Audio Generation

POST /v1/audio/speech

Highly controllable TTS with new Audio Tags for precise style, tone, pace, and delivery across narration, assistants, and voice apps.

At a glance

Field	Value
Model id	`gemini-3-1-flash-tts`
Model release date	2026-04-13
Input modalities	Text
Output modalities	Audio
Context window	-
Weight precision	-
Features	text_to_speech, multi_speaker, multilingual
Native inference	No
New	No
Supported endpoints	`POST /v1/audio/speech`
Alternate model ids	`gemini-3.1-flash-tts`, `google/gemini-3.1-flash-tts`

Pricing

Charge	Spec	Rate
Input	per 1M prompt tokens	$2.60
Output	per 1M generated tokens	$52.00

Example request

$ curl https://api.empiriolabs.ai/v1/audio/speech \
>   -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
>   -H 'Content-Type: application/json' \
>   -d '{"model": "gemini-3-1-flash-tts", "input": "Hello from EmpirioLabs."}'

Parameters

Parameter	Type	Required	Default	Description
`input`	string	yes	-	Text to convert to speech. For multi-speaker mode, prefix lines with Speaker1: / Speaker2:.
`mode`	enum	no	`"single"`	single = one voice, multi = two-voice dialogue (uses voice + voice2 + speaker names). · Allowed: `single`, `multi`
`language`	string	no	`"en-US"`	BCP-47 language tag (en-US, es-ES, etc.) for pronunciation cues.
`voice`	enum	no	`"Charon"`	Primary voice name (e.g. Kore, Puck, Aoede). Leave blank for the default. · Allowed: `Zephyr`, `Puck`, `Charon`, `Kore`, `Fenrir`, `Leda`, `Orus`, `Aoede`, `Callirrhoe`, `Autonoe`, `Enceladus`, `Iapetus`, `Umbriel`, `Algieba`, `Despina`, `Erinome`, `Algenib`, `Rasalgethi`, `Laomedeia`, `Achernar`, `Alnilam`, `Schedar`, `Gacrux`, `Pulcherrima`, `Achird`, `Zubenelgenubi`, `Vindemiatrix`, `Sadachbia`, `Sadaltager`, `Sulafat`
`voice2`	enum	no	`"Kore"`	Second voice name for multi-speaker mode. · Allowed: `Zephyr`, `Puck`, `Charon`, `Kore`, `Fenrir`, `Leda`, `Orus`, `Aoede`, `Callirrhoe`, `Autonoe`, `Enceladus`, `Iapetus`, `Umbriel`, `Algieba`, `Despina`, `Erinome`, `Algenib`, `Rasalgethi`, `Laomedeia`, `Achernar`, `Alnilam`, `Schedar`, `Gacrux`, `Pulcherrima`, `Achird`, `Zubenelgenubi`, `Vindemiatrix`, `Sadachbia`, `Sadaltager`, `Sulafat`
`speaker1_name`	string	no	`"Speaker1"`	Display name used in the input prefix for speaker 1 (default: Speaker1).
`speaker2_name`	string	no	`"Speaker2"`	Display name used in the input prefix for speaker 2 (default: Speaker2).
`output_format`	enum	no	`"WAV"`	Audio file format (mp3, wav, opus, flac, etc.). · Allowed: `WAV`, `MP3`, `OGG`, `ALAW`, `MULAW`
`speed`	number	no	`1.0`	Playback rate. 1.0 = natural; <1 slower, >1 faster. · Range: 0.25 – 2.0
`volume_gain`	number	no	`0`	Output gain in dB. 0 = unchanged. · Range: -96 – 16
`sample_rate`	enum	no	`"24000"`	Output sample rate in Hz (8000, 16000, 24000, 44100, 48000). · Allowed: `8000`, `16000`, `22050`, `24000`, `44100`, `48000`
`style_prompt`	string	no	-	Natural-language style direction (e.g. “warm, conversational” or “newscaster, serious”).

Notes

Most controllable Gemini TTS to date.

Limits

Text + style prompt: 4,000 bytes each (8,000 combined)
Max output: ~10 minutes
Audio billing: ~25 tokens per second (~15 chars/s)
Language is auto-detected; the language setting is a hint, not a constraint

Inline audio tags (control delivery)

Emotion: [whispers], [shouts], [laughs], [sighs], [cheerful], [sad], [angry], etc.
Pace: [slow], [fast], [extremely fast], [normal pace]
Pauses: [short pause], [long pause], [breath]
Emphasis: [softly], [loudly], [high pitch], [low pitch], [rising tone], [falling tone]

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/gemini-3-1-flash-tts.