Tongyi-Embedding-Vision-Plus

Tongyi-Embedding-Vision-Plus
Alibaba Cloud · Embedding
POST /v1/chat/completions

Multimodal embedding model that produces independent vectors for text, image, and video inputs. Use this when each content element needs its own embedding (e.g. matching a caption against a set of images).

At a glance

FieldValue
Model idtongyi-embedding-vision-plus
Input modalitiestext, image, video
Output modalitiesembedding
Context window1024
RegionSingapore
Featuresmultimodal, independent vectors
NewYes
Native inferenceNo
Supported endpointsPOST /v1/embeddings

Pricing

ChargeSpecRate
Text inputper 1M tokens$0.09
Image / video inputper 1M tokens$0.09

Example request

$curl https://api.empiriolabs.ai/v1/chat/completions \
> -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
> -H 'Content-Type: application/json' \
> -d '{"model": "tongyi-embedding-vision-plus", "messages": [{"role":"user","content":"Hello"}]}'

Parameters

ParameterTypeRequiredDefaultDescription
inputarrayyesArray of content parts. Either OpenAI shape [{type:"image",url:"..."},{type:"text",text:"..."}] or DashScope shape {contents:[{image:"..."},{text:"..."}]}. Up to 8 images @3 MB each, video up to 10 MB, text up to 1024 tokens.
userstringno

Notes

Embedding dimension: fixed at 1152.\n\nPer-input limits:\n\n- Text: up to 1,024 tokens\n- Images: up to 8 per request, max 3 MB each (JPG, PNG, BMP)\n- Video: up to 10 MB per file (MP4, MPEG, MOV, MPG, WEBM, AVI, FLV, MKV)\n\nOutput: independent vector per input element (no fusion).\n\nLanguages: Chinese and English.


Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/tongyi-embedding-vision-plus.