TTS 1.5 Mini

TTS 1.5 Mini
Inworld · Audio Generation
POST /v1/audio/speech

Sub-130ms TTFB voice synthesis with 271+ voices across 15 languages, expressive prosody, and real-time SSE streaming for low-latency voice agents.

At a glance

FieldValue
Model idtts-1-5-mini
Input modalitiesText
Output modalitiesAudio
Context window
Weight precision-
Featuresmulti_speaker, real_time, low_latency, streaming, word_timestamps, character_timestamps, multilingual, expressive_prosody
Native inferenceNo
NewYes
Supported endpointsPOST /v1/audio/speech, POST /v1/audio/speech:stream, GET /v1/voices

Pricing

ChargeSpecRate
Synthesisper 1M characters$17.50 (was $25.00)

Example request

$curl https://api.empiriolabs.ai/v1/audio/speech \
> -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
> -H 'Content-Type: application/json' \
> -d '{"model": "tts-1-5-mini", "input": "Hello from EmpirioLabs."}'

Parameters

ParameterTypeRequiredDefaultDescription
inputstringyesText to synthesize. Max 2,000 characters per request — chunk longer copy at sentence boundaries on the client. · Max: 2000
voiceenumno"Sarah"Voice preset. 20 hand-picked voices covering English + Spanish + Portuguese + Hindi + various accents. For the full 271-voice catalog (including cloned voices), use voice_id instead. · Allowed: Sarah, Olivia, Elizabeth, Ashley, Wendy, Julia, Priya, Pixie, Deborah, Alex, Mark, Edward, Theodore, Ronald, Dennis, Timothy, Shaun, Craig, Hades, Heitor
voice_idstringnoFree-form voice ID. Overrides voice when set. Use this to address voices outside the curated 20-preset list — Inworld TTS 1.5 ships 271+ named voices across 15 languages (regional accents, gendered variants). Example: Maitê, Olivia, or any voice name from GET /v1/voices.
languageenumno"en-US"BCP-47 language code. Inworld TTS 1.5 covers 15 languages. · Allowed: en-US, en-GB, es-ES, es-MX, fr-FR, de-DE, it-IT, pt-BR, pt-PT, nl-NL, pl-PL, ru-RU, ja-JP, ko-KR, zh-CN, hi-IN, ar-EG, he-IL
output_formatenumno"WAV"Audio container/codec. WAV = LINEAR16 inside RIFF (ubiquitous). MP3 / OGG = compressed. PCM = headerless raw — useful for chunked-real-time playback. FLAC = lossless. · Allowed: MP3, WAV, OGG, FLAC, PCM, ALAW, MULAW
sample_rateenumno"24000"Output sample rate in Hz. 24000 is Inworld’s default and what their voice models train at; raise to 48000 for broadcast quality. · Allowed: 8000, 16000, 22050, 24000, 32000, 44100, 48000
speednumberno1.0Speaking rate multiplier. 0.5 = half speed, 1.5 = 50% faster. · Range: 0.5 – 1.5
temperaturenumberno1.0Voice expressiveness / variability. Lower = more consistent / “flat”; higher = more expressive but more variation between renders. · Range: 0.1 – 2.0
bit_ratenumberno128000Bitrate in bps for MP3 / OGG_OPUS. Ignored for other encodings. · Range: 32000 – 320000
apply_text_normalizationenumno"ON"When ON, Inworld expands numbers / abbreviations / dates into spoken form (“USD 5” → “five US dollars”). · Allowed: ON, OFF
timestamp_typeenumno"NONE"If non-NONE, the response includes per-word or per-character timestamps in timestamp_info. Useful for caption / highlight UIs. · Allowed: NONE, WORD, CHARACTER

Notes

Limits

  • Max input: 2,000 characters per request (chunk longer text at sentence boundaries)
  • WebSocket: 20 concurrent connections, 5 contexts/connection
  • Per-WS message: 1,000 characters

Latency

  • p90 TTFB: under 130 ms (Inworld benchmark)

Voices

  • 271+ named presets across 15 languages
  • 20 hand-picked presets exposed in the dropdown; pass any other voice ID via voice_id

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/tts-1-5-mini.