Gemini 3.1 Flash TTS

Gemini 3.1 Flash TTS
Google · Audio Generation
POST /v1/audio/speech

Highly controllable TTS with new Audio Tags for precise style, tone, pace, and delivery across narration, assistants, and voice apps.

At a glance

FieldValue
Model idgemini-3-1-flash-tts
Input modalitiesText
Output modalitiesAudio
Context window-
Weight precision-
Features-
Native inferenceNo
NewYes
Supported endpointsPOST /v1/audio/speech

Pricing

ChargeSpecRate
Inputper 1M prompt tokens$2.60
Outputper 1M generated tokens$52.00

Example request

$curl https://api.empiriolabs.ai/v1/audio/speech \
> -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
> -H 'Content-Type: application/json' \
> -d '{"model": "gemini-3-1-flash-tts", "input": "Hello from EmpirioLabs."}'

Parameters

ParameterTypeRequiredDefaultDescription
inputstringyes-Text to convert to speech. For multi-speaker mode, prefix lines with Speaker1: / Speaker2:.
modeenumno"single"single = one voice, multi = two-voice dialogue (uses voice + voice2 + speaker names). · Allowed: single, multi
languagestringno"en-US"BCP-47 language tag (en-US, es-ES, etc.) for pronunciation cues.
voiceenumno"Charon"Primary voice name (e.g. Kore, Puck, Aoede). Leave blank for the default. · Allowed: Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, Callirrhoe, Autonoe, Enceladus, Iapetus, Umbriel, Algieba, Despina, Erinome, Algenib, Rasalgethi, Laomedeia, Achernar, Alnilam, Schedar, Gacrux, Pulcherrima, Achird, Zubenelgenubi, Vindemiatrix, Sadachbia, Sadaltager, Sulafat
voice2enumno"Kore"Second voice name for multi-speaker mode. · Allowed: Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, Callirrhoe, Autonoe, Enceladus, Iapetus, Umbriel, Algieba, Despina, Erinome, Algenib, Rasalgethi, Laomedeia, Achernar, Alnilam, Schedar, Gacrux, Pulcherrima, Achird, Zubenelgenubi, Vindemiatrix, Sadachbia, Sadaltager, Sulafat
speaker1_namestringno"Speaker1"Display name used in the input prefix for speaker 1 (default: Speaker1).
speaker2_namestringno"Speaker2"Display name used in the input prefix for speaker 2 (default: Speaker2).
output_formatenumno"WAV"Audio file format (mp3, wav, opus, flac, etc.). · Allowed: WAV, MP3, OGG, ALAW, MULAW
speednumberno1.0Playback rate. 1.0 = natural; <1 slower, >1 faster. · Range: 0.25 – 2.0
volume_gainnumberno0Output gain in dB. 0 = unchanged. · Range: -96 – 16
sample_rateenumno"24000"Output sample rate in Hz (8000, 16000, 24000, 44100, 48000). · Allowed: 8000, 16000, 22050, 24000, 44100, 48000
style_promptstringno-Natural-language style direction (e.g. “warm, conversational” or “newscaster, serious”).

Notes

Most controllable Gemini TTS to date.

Limits

  • Text + style prompt: 4,000 bytes each (8,000 combined)
  • Max output: ~10 minutes
  • Audio billing: ~25 tokens per second (~15 chars/s)
  • Language is auto-detected; the language setting is a hint, not a constraint

Inline audio tags (control delivery)

  • Emotion: [whispers], [shouts], [laughs], [sighs], [cheerful], [sad], [angry], etc.
  • Pace: [slow], [fast], [extremely fast], [normal pace]
  • Pauses: [short pause], [long pause], [breath]
  • Emphasis: [softly], [loudly], [high pitch], [low pitch], [rising tone], [falling tone]

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/gemini-3-1-flash-tts.