Gemini 2.5 Flash TTS

Gemini 2.5 Flash TTS
Google · Audio Generation
POST /v1/audio/speech

Low-latency text-to-speech with single- and multi-speaker voices and controllable style, accent, and expressive tone for production apps.

At a glance

FieldValue
Model idgemini-2-5-flash-tts
Input modalitiesText
Output modalitiesAudio
Context window-
Weight precision-
Features-
Native inferenceNo
NewNo
Supported endpointsPOST /v1/audio/speech

Pricing

ChargeSpecRate
Inputper 1M prompt tokens$1.50
Outputper 1M generated tokens$30.00

Example request

$curl https://api.empiriolabs.ai/v1/audio/speech \
> -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
> -H 'Content-Type: application/json' \
> -d '{"model": "gemini-2-5-flash-tts", "input": "Hello from EmpirioLabs."}'

Parameters

ParameterTypeRequiredDefaultDescription
inputstringyes-Text to convert to speech. For multi-speaker mode, prefix lines with Speaker1: / Speaker2:.
modeenumno"single"single = one voice, multi = two-voice dialogue (uses voice + voice2 + speaker names). · Allowed: single, multi
languagestringno"en-US"BCP-47 language tag (en-US, es-ES, etc.) for pronunciation cues.
voiceenumno"Charon"Primary voice name (e.g. Kore, Puck, Aoede). Leave blank for the default. · Allowed: Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, Callirrhoe, Autonoe, Enceladus, Iapetus, Umbriel, Algieba, Despina, Erinome, Algenib, Rasalgethi, Laomedeia, Achernar, Alnilam, Schedar, Gacrux, Pulcherrima, Achird, Zubenelgenubi, Vindemiatrix, Sadachbia, Sadaltager, Sulafat
voice2enumno"Kore"Second voice name for multi-speaker mode. · Allowed: Zephyr, Puck, Charon, Kore, Fenrir, Leda, Orus, Aoede, Callirrhoe, Autonoe, Enceladus, Iapetus, Umbriel, Algieba, Despina, Erinome, Algenib, Rasalgethi, Laomedeia, Achernar, Alnilam, Schedar, Gacrux, Pulcherrima, Achird, Zubenelgenubi, Vindemiatrix, Sadachbia, Sadaltager, Sulafat
speaker1_namestringno"Speaker1"Display name used in the input prefix for speaker 1 (default: Speaker1).
speaker2_namestringno"Speaker2"Display name used in the input prefix for speaker 2 (default: Speaker2).
output_formatenumno"WAV"Audio file format (mp3, wav, opus, flac, etc.). · Allowed: WAV, MP3, OGG, ALAW, MULAW
speednumberno1.0Playback rate. 1.0 = natural; <1 slower, >1 faster. · Range: 0.25 – 2.0
volume_gainnumberno0Output gain in dB. 0 = unchanged. · Range: -96 – 16
sample_rateenumno"24000"Output sample rate in Hz (8000, 16000, 24000, 44100, 48000). · Allowed: 8000, 16000, 22050, 24000, 44100, 48000
style_promptstringno-Natural-language style direction (e.g. “warm, conversational” or “newscaster, serious”).

Notes

Modes

  • Single speaker
  • Multi-speaker (max 2 voices) — text must be in SpeakerName: text format

Limits

  • Text + style prompt: 4,000 bytes each
  • Audio billing: ~32 tokens per second of generated audio (~10-15 chars/s)

Voices and languages

  • 30+ voice options across emotional/tonal characters
  • 24+ language locales supported

Output formats

  • MP3, WAV, OGG

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/gemini-2-5-flash-tts.