input | string | yes | - | Text to synthesize. For multi-speaker use [S1] / [S2] tags or ‘Speaker N:’ lines. |
voice | enum | no | "emma" | emma=English Female, james=US Male, arthur=US Male alt, xiaomei=Chinese Female, zhigang=Chinese Male, custom=upload reference via voice_audio_url. · Allowed: emma, james, arthur, xiaomei, zhigang, custom |
voice_audio_url | string | no | - | Reference audio URL for custom voice cloning. The reference recording must contain the speaker reading this exact consent phrase aloud, in their own voice: “I consent to Empirio Labs cloning my voice for the purpose of generating synthetic speech. I understand that my voice sample will be used to create personalized audio content.” Reference audio without the phrase is rejected. |
output_format | enum | no | "mp3" | Output media file format (mp3, wav, mp4, png, jpg, etc., depending on the endpoint). · Allowed: mp3, wav |
speed | number | no | 1.0 | Speaking rate multiplier. · Range: 0.5 – 2.0 |
model_quality | enum | no | "quality" | quality=FP16 (better), fast=INT8 (quicker) · Allowed: quality, fast |
sample_rate | enum | no | "24000" | Output sample rate in Hz. · Allowed: 24000, 16000 |
volume | number | no | 1.0 | Output gain multiplier. · Range: 0.1 – 2.0 |
use_cache | boolean | no | true | Speeds up repeated identical generations. |
optimize_input | boolean | no | true | Auto-fix pronunciation of technical terms, acronyms, and special characters. |
seed | number | no | - | Reproducibility seed. |