TTS 1.5 Max | EmpirioLabs AI Docs

Inworld · Audio Generation

POST /v1/audio/speech

Broadcast-quality voice synthesis with rich expressive prosody, 271+ voices across 15 languages, and real-time SSE streaming with per-word timestamps.

At a glance

Field	Value
Model id	`tts-1-5-max`
Model release date	2026-01-21
Input modalities	Text
Output modalities	Audio
Context window	-
Weight precision	-
Features	multi_speaker, real_time, streaming, word_timestamps, character_timestamps, multilingual, expressive_prosody, broadcast_quality
Native inference	No
New	No
Supported endpoints	`POST /v1/audio/speech`, `POST /v1/audio/speech:stream`, `GET /v1/voices`

Pricing

Charge	Spec	Rate
Synthesis	per 1M characters	$29.75 (was $35.00)

Example request

$ curl https://api.empiriolabs.ai/v1/audio/speech \
>   -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
>   -H 'Content-Type: application/json' \
>   -d '{"model": "tts-1-5-max", "input": "Hello from EmpirioLabs."}'

Parameters

Parameter	Type	Required	Default	Description
`input`	string	yes	-	Text to synthesize. Max 2,000 characters per request; chunk longer copy at sentence boundaries on the client. · Max: 2000
`voice`	enum	no	`"Sarah"`	Voice preset. 20 hand-picked voices covering English + Spanish + Portuguese + Hindi + various accents. For the full 271-voice catalog (including cloned voices), use voice_id instead. · Allowed: `Sarah`, `Olivia`, `Elizabeth`, `Ashley`, `Wendy`, `Julia`, `Priya`, `Pixie`, `Deborah`, `Alex`, `Mark`, `Edward`, `Theodore`, `Ronald`, `Dennis`, `Timothy`, `Shaun`, `Craig`, `Hades`, `Heitor`
`voice_id`	string	no	-	Free-form voice ID. Overrides voice when set. Use this to address voices outside the curated 20-preset list. Inworld TTS 1.5 ships 271+ named voices across 15 languages (regional accents, gendered variants). Example: Maitê, Olivia, or any voice name from GET /v1/voices.
`language`	enum	no	`"en-US"`	BCP-47 language code. Inworld TTS 1.5 covers 15 languages. · Allowed: `en-US`, `en-GB`, `es-ES`, `es-MX`, `fr-FR`, `de-DE`, `it-IT`, `pt-BR`, `pt-PT`, `nl-NL`, `pl-PL`, `ru-RU`, `ja-JP`, `ko-KR`, `zh-CN`, `hi-IN`, `ar-EG`, `he-IL`
`output_format`	enum	no	`"WAV"`	Audio container/codec. WAV = LINEAR16 inside RIFF (ubiquitous). MP3 / OGG = compressed. PCM = headerless raw, useful for chunked real-time playback. FLAC = lossless. · Allowed: `MP3`, `WAV`, `OGG`, `FLAC`, `PCM`, `ALAW`, `MULAW`
`sample_rate`	enum	no	`"24000"`	Output sample rate in Hz. 24000 is Inworld’s default and what their voice models train at; raise to 48000 for broadcast quality. · Allowed: `8000`, `16000`, `22050`, `24000`, `32000`, `44100`, `48000`
`speed`	number	no	`1.0`	Speaking rate multiplier. 0.5 = half speed, 1.5 = 50% faster. · Range: 0.5 – 1.5
`temperature`	number	no	`1.0`	Voice expressiveness / variability. Lower = more consistent / “flat”; higher = more expressive but more variation between renders. · Range: 0.1 – 2.0
`bit_rate`	number	no	`128000`	Bitrate in bps for MP3 / OGG_OPUS. Ignored for other encodings. · Range: 32000 – 320000
`apply_text_normalization`	enum	no	`"ON"`	When ON, Inworld expands numbers / abbreviations / dates into spoken form (“USD 5” → “five US dollars”). · Allowed: `ON`, `OFF`
`timestamp_type`	enum	no	`"NONE"`	If non-NONE, the response includes per-word or per-character timestamps in timestamp_info. Useful for caption / highlight UIs. · Allowed: `NONE`, `WORD`, `CHARACTER`

Notes

Limits

Max input: 2,000 characters per request (chunk longer text at sentence boundaries)
WebSocket: 20 concurrent connections, 5 contexts/connection
Per-WS message: 1,000 characters

Latency

p90 TTFB: under 250 ms (Inworld benchmark)

Voices

271+ named presets across 15 languages
20 hand-picked presets exposed in the dropdown; pass any other voice ID via voice_id

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/tts-1-5-max.