input | string | yes | — | Text to synthesize. Max 2,000 characters per request — chunk longer copy at sentence boundaries on the client. · Max: 2000 |
voice | enum | no | "Sarah" | Voice preset. 20 hand-picked voices covering English + Spanish + Portuguese + Hindi + various accents. For the full 271-voice catalog (including cloned voices), use voice_id instead. · Allowed: Sarah, Olivia, Elizabeth, Ashley, Wendy, Julia, Priya, Pixie, Deborah, Alex, Mark, Edward, Theodore, Ronald, Dennis, Timothy, Shaun, Craig, Hades, Heitor |
voice_id | string | no | — | Free-form voice ID. Overrides voice when set. Use this to address voices outside the curated 20-preset list — Inworld TTS 1.5 ships 271+ named voices across 15 languages (regional accents, gendered variants). Example: Maitê, Olivia, or any voice name from GET /v1/voices. |
language | enum | no | "en-US" | BCP-47 language code. Inworld TTS 1.5 covers 15 languages. · Allowed: en-US, en-GB, es-ES, es-MX, fr-FR, de-DE, it-IT, pt-BR, pt-PT, nl-NL, pl-PL, ru-RU, ja-JP, ko-KR, zh-CN, hi-IN, ar-EG, he-IL |
output_format | enum | no | "WAV" | Audio container/codec. WAV = LINEAR16 inside RIFF (ubiquitous). MP3 / OGG = compressed. PCM = headerless raw — useful for chunked-real-time playback. FLAC = lossless. · Allowed: MP3, WAV, OGG, FLAC, PCM, ALAW, MULAW |
sample_rate | enum | no | "24000" | Output sample rate in Hz. 24000 is Inworld’s default and what their voice models train at; raise to 48000 for broadcast quality. · Allowed: 8000, 16000, 22050, 24000, 32000, 44100, 48000 |
speed | number | no | 1.0 | Speaking rate multiplier. 0.5 = half speed, 1.5 = 50% faster. · Range: 0.5 – 1.5 |
temperature | number | no | 1.0 | Voice expressiveness / variability. Lower = more consistent / “flat”; higher = more expressive but more variation between renders. · Range: 0.1 – 2.0 |
bit_rate | number | no | 128000 | Bitrate in bps for MP3 / OGG_OPUS. Ignored for other encodings. · Range: 32000 – 320000 |
apply_text_normalization | enum | no | "ON" | When ON, Inworld expands numbers / abbreviations / dates into spoken form (“USD 5” → “five US dollars”). · Allowed: ON, OFF |
timestamp_type | enum | no | "NONE" | If non-NONE, the response includes per-word or per-character timestamps in timestamp_info. Useful for caption / highlight UIs. · Allowed: NONE, WORD, CHARACTER |