GLM TTS

GLM TTS
Z.ai · Audio Generation
POST /v1/audio/speech

LLM-based text-to-speech with zero-shot voice cloning from 3-10s of audio and emotion-expressive, controllable output via multi-reward RL.

At a glance

FieldValue
Model idglm-tts
Input modalitiesText, Audio
Output modalitiesAudio
Context window-
Weight precisionINT8 / FP16
Featuresvoice_cloning, emotion_control
Native inferenceYes
NewNo
Supported endpointsPOST /v1/audio/speech

Pricing

ChargeSpecRate
Fast (INT8)per 1k characters$0.20
Quality (FP16)per 1k characters$0.21

Example request

$curl https://api.empiriolabs.ai/v1/audio/speech \
> -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
> -H 'Content-Type: application/json' \
> -d '{"model": "glm-tts", "input": "Hello from EmpirioLabs."}'

Parameters

ParameterTypeRequiredDefaultDescription
inputstringyes-Text to synthesize. For multi-speaker use [S1] / [S2] tags or ‘Speaker N:’ lines.
voiceenumno"emma"emma=English Female, james=US Male, arthur=US Male alt, xiaomei=Chinese Female, zhigang=Chinese Male, custom=upload reference via voice_audio_url. · Allowed: emma, james, arthur, xiaomei, zhigang, custom
voice_audio_urlstringno-Reference audio URL for custom voice cloning. The reference recording must contain the speaker reading this exact consent phrase aloud, in their own voice: “I consent to Empirio Labs cloning my voice for the purpose of generating synthetic speech. I understand that my voice sample will be used to create personalized audio content.” Reference audio without the phrase is rejected.
output_formatenumno"mp3"Output media file format (mp3, wav, mp4, png, jpg, etc., depending on the endpoint). · Allowed: mp3, wav
speednumberno1.0Speaking rate multiplier. · Range: 0.5 – 2.0
model_qualityenumno"quality"quality=FP16 (better), fast=INT8 (quicker) · Allowed: quality, fast
sample_rateenumno"24000"Output sample rate in Hz. · Allowed: 24000, 16000
volumenumberno1.0Output gain multiplier. · Range: 0.1 – 2.0
use_cachebooleannotrueSpeeds up repeated identical generations.
optimize_inputbooleannotrueAuto-fix pronunciation of technical terms, acronyms, and special characters.
seednumberno-Reproducibility seed.

Notes

Limits

  • Max input: 5,000 characters
  • Generation: 5-10 minutes

Voice cloning

  • Reference audio: 3-10 seconds
  • Accepted formats: WAV, MP3, OGG, FLAC, AAC, M4A, WebM

Preset voices

  • emma (English F)
  • james (US M)
  • arthur (UK M)
  • xiaomei (Chinese F)
  • zhigang (Chinese M)

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/glm-tts.