Tongyi-Embedding-Vision-Plus | EmpirioLabs AI Docs

POST /v1/chat/completions

Multimodal embedding model that produces independent vectors for text, image, and video inputs. Use this when each content element needs its own embedding (e.g. matching a caption against a set of images).

At a glance

Field	Value
Model id	`tongyi-embedding-vision-plus`
Input modalities	text, image, video
Output modalities	embedding
Context window	1024
Region	Singapore
Features	multimodal, independent vectors
New	Yes
Native inference	No
Supported endpoints	`POST /v1/embeddings`

Pricing

Charge	Spec	Rate
Text input	per 1M tokens	$0.09
Image / video input	per 1M tokens	$0.09

Example request

$ curl https://api.empiriolabs.ai/v1/chat/completions \
>   -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
>   -H 'Content-Type: application/json' \
>   -d '{"model": "tongyi-embedding-vision-plus", "messages": [{"role":"user","content":"Hello"}]}'

Parameters

Parameter	Type	Required	Default	Description
`input`	array	yes	—	Array of content parts. Either OpenAI shape `[{type:"image",url:"..."},{type:"text",text:"..."}]` or DashScope shape `{contents:[{image:"..."},{text:"..."}]}`. Up to 8 images @3 MB each, video up to 10 MB, text up to 1024 tokens.
`user`	string	no	—	—

Notes

Embedding dimension: fixed at 1152.\n\nPer-input limits:\n\n- Text: up to 1,024 tokens\n- Images: up to 8 per request, max 3 MB each (JPG, PNG, BMP)\n- Video: up to 10 MB per file (MP4, MPEG, MOV, MPG, WEBM, AVI, FLV, MKV)\n\nOutput: independent vector per input element (no fusion).\n\nLanguages: Chinese and English.

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/tongyi-embedding-vision-plus.