MOSS Video and Audio

MOSS Video and Audio
OpenMOSS · Video Generation
POST /v1/videos/generations

Open-source 32B MoE foundation model that generates synchronized video and audio in one inference step with precise dual-tower lip-sync.

At a glance

FieldValue
Model idmoss-video-and-audio
Input modalitiesText, Image
Output modalitiesVideo, Audio
Context window-
Weight precision-
Featuresaudio_sync, lipsync
Native inferenceYes
NewNo
Supported endpointsPOST /v1/videos/generations

Pricing

ChargeSpecRate
360p Videoper video$0.17
720p Videoper video$2.82
T2V Fastadditional fee$0.065
T2V Qualityadditional fee$0.13

Example request

$curl https://api.empiriolabs.ai/v1/videos/generations \
> -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
> -H 'Content-Type: application/json' \
> -d '{"model": "moss-video-and-audio", "prompt": "sunrise over the ocean", "duration": 6}'

Parameters

ParameterTypeRequiredDefaultDescription
promptstringyes-Scene description. With image attached, becomes an image-to-video prompt.
modeenumno"t2v"t2v: pure text-to-video. i2v: animate the attached image. · Allowed: t2v, i2v
resolutionenumno"720p"720p uses a separate higher-VRAM endpoint. · Allowed: 360p, 720p
aspect_ratioenumno"landscape"MOSS only supports landscape (16:9) and portrait (9:16). · Allowed: landscape, portrait
durationnumberno8Clip length in seconds. The upstream model is hard-capped at 8s. · Range: 2 – 8
t2v_qualityenumno"quality"Text-to-video only. fast trades fidelity for ~2× speed. · Allowed: fast, quality
num_inference_stepsnumberno25Diffusion steps. More = higher fidelity, slower. · Range: 10 – 50
cfg_scalenumberno5.0Classifier-free guidance. Higher = follows prompt more strictly. · Range: 1.0 – 10.0
sigma_shiftnumberno5.0Schedule shift. Only valid when resolution=360p. · Range: 1.0 – 10.0
imagestringno-Reference image URL for i2v mode.
negative_promptstringno""What to avoid.
seednumberno-Reproducibility seed.

Notes

32B-parameter MoE with synchronized lip-sync video + audio in a single inference.

Constraints

  • Generation can take 20+ minutes
  • Image-to-Video typically yields superior results to text-to-video
  • Only 1 image supported (used as the first frame)
  • Video inputs NOT supported

Image formats

  • jpg, jpeg, png, webp, heic, heif, bmp, tiff, tif, gif

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/moss-video-and-audio.