MOSS Video and Audio | EmpirioLabs AI Docs

OpenMOSS · Video Generation

POST /v1/videos/generations

Open-source 32B MoE foundation model that generates synchronized video and audio in one inference step with precise dual-tower lip-sync.

At a glance

Field	Value
Model id	`moss-video-and-audio`
Model release date	2026-01-29
Input modalities	Text, Image
Output modalities	Video, Audio
Context window	-
Weight precision	-
Features	audio_sync, lipsync
Native inference	Yes
New	No
Supported endpoints	`POST /v1/videos/generations`
Alternate model ids	`moss-video-audio`, `openmoss/video-audio`

Pricing

Charge	Spec	Rate
360p Video	per video	$0.17
720p Video	per video	$2.82
T2V Fast	additional fee	$0.065
T2V Quality	additional fee	$0.13

Example request

$ curl https://api.empiriolabs.ai/v1/videos/generations \
>   -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
>   -H 'Content-Type: application/json' \
>   -d '{"model": "moss-video-and-audio", "prompt": "sunrise over the ocean", "duration": 6}'

Parameters

Parameter	Type	Required	Default	Description
`prompt`	string	yes	-	Scene description. With image attached, becomes an image-to-video prompt.
`mode`	enum	no	`"t2v"`	t2v: pure text-to-video. i2v: animate the attached image. · Allowed: `t2v`, `i2v`
`resolution`	enum	no	`"720p"`	720p uses a separate higher-VRAM endpoint. · Allowed: `360p`, `720p`
`aspect_ratio`	enum	no	`"landscape"`	MOSS only supports landscape (16:9) and portrait (9:16). · Allowed: `landscape`, `portrait`
`duration`	number	no	`8`	Clip length in seconds. The upstream model is hard-capped at 8s. · Range: 2 – 8
`t2v_quality`	enum	no	`"quality"`	Text-to-video only. fast trades fidelity for ~2× speed. · Allowed: `fast`, `quality`
`num_inference_steps`	number	no	`25`	Diffusion steps. More = higher fidelity, slower. · Range: 10 – 50
`cfg_scale`	number	no	`5.0`	Classifier-free guidance. Higher = follows prompt more strictly. · Range: 1.0 – 10.0
`sigma_shift`	number	no	`5.0`	Schedule shift. Only valid when resolution=360p. · Range: 1.0 – 10.0
`image`	string	no	-	Reference image URL for i2v mode.
`negative_prompt`	string	no	`""`	What to avoid.
`seed`	number	no	-	Reproducibility seed.

Notes

32B-parameter MoE with synchronized lip-sync video + audio in a single inference.

Constraints

Generation can take 20+ minutes
Image-to-Video typically yields superior results to text-to-video
Only 1 image supported (used as the first frame)
Video inputs NOT supported

Image formats

jpg, jpeg, png, webp, heic, heif, bmp, tiff, tif, gif

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/moss-video-and-audio.