Stable Audio 2.0 | EmpirioLabs AI Docs

Stability AI · Audio Generation

POST /v1/audio/generations

Generates audio up to 3 minutes from text prompts, supporting text-to-audio and audio-to-audio with adjustable duration, steps, and CFG scale.

At a glance

Field	Value
Model id	`stable-audio-2-0`
Model release date	2024-04-03
Input modalities	Text
Output modalities	Audio
Context window	-
Weight precision	-
Features	music_generation, text_to_audio, sound_effects
Native inference	No
New	No
Supported endpoints	`POST /v1/audio/generations`
Alternate model ids	`stability-audio-2.0`, `stability/audio-2.0`

Pricing

Charge	Spec	Rate
Base Cost	per generation	$0.58
Per Step Cost	per step	$0.00

Example request

$ curl https://api.empiriolabs.ai/v1/audio/generations \
>   -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
>   -H 'Content-Type: application/json' \
>   -d '{"model": "stable-audio-2-0", "prompt": "warm jazz piano", "duration": 8}'

Parameters

Parameter	Type	Required	Default	Description
`prompt`	string	yes	-	What to generate. Be specific about genre, instruments, mood, and tempo.
`mode`	enum	no	`"text-to-audio"`	text-to-audio: generate from prompt only. audio-to-audio: condition on a reference clip. · Allowed: `text-to-audio`, `audio-to-audio`
`output_format`	enum	no	`"mp3"`	Output media file format (mp3, wav, mp4, png, jpg, etc., depending on the endpoint). · Allowed: `mp3`, `wav`
`duration`	number	no	`190`	Seconds. Stability Audio 2.0 generates up to 3 minutes 10 seconds. · Range: 1 – 190
`steps`	number	no	`50`	Diffusion steps. More = higher fidelity, slower (and adds per-step credits). · Range: 30 – 100
`cfg_scale`	number	no	`7`	Classifier-free guidance. Higher = follows prompt more strictly. · Range: 1 – 25
`strength`	number	no	`1`	Audio-to-audio only. 0 = ignore reference, 1 = stay close to reference. · Range: 0 – 1
`random_seed`	boolean	no	true	If true, use a random seed each call.
`seed`	number	no	-	Reproducibility seed. Only used when random_seed=false.
`audio_url`	string	no	-	Reference audio URL for audio-to-audio mode.

Notes

Generates up to 3 minutes of audio from text or via audio-to-audio transformation.

Audio-to-audio mode

Requires BOTH a prompt and an uploaded audio file
Recommended CFG scale: 7-15
Recommended steps: 6-8
Typical strength: 0.3-0.7

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/stable-audio-2-0.