MiMo V2.5

MiMo V2.5
Xiaomi · Text Generation
POST /v1/chat/completions

Multimodal model with native visual and audio understanding on a 1M context, designed to reason and act across modalities in agentic workflows.

At a glance

FieldValue
Model idmimo-v2-5
Input modalitiesText, Image, Video, Audio
Output modalitiesText
Context window1M
Weight precision-
Max output tokens128,000
Featuresvision, audio_in
Native inferenceNo
NewYes
Supported endpointsPOST /v1/chat/completions, POST /v1/responses, POST /v1/messages

Pricing

ChargeSpecRate
Inputper 1M prompt tokens$0.70
Outputper 1M generated tokens$1.40
Implicit cache readper 1M cached input tokens$0.014
Web Searchper call$0.015

Example request

$curl https://api.empiriolabs.ai/v1/chat/completions \
> -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
> -H 'Content-Type: application/json' \
> -d '{"model": "mimo-v2-5", "messages": [{"role":"user","content":"Hello"}]}'

Parameters

ParameterTypeRequiredDefaultDescription
enable_thinkingbooleannotrueEnable extended thinking mode. Slower but improves reasoning-heavy tasks.
tool_web_searchbooleannofalseAllow the model to perform web searches when needed.
web_search_forcebooleannofalseForce the model to always run a web search before answering.
web_search_max_keywordnumberno3Max number of keywords the model can use across web searches. · Range: 1 – 5
web_search_limitnumberno5Max number of web searches the model can perform per request. · Range: 1 – 10
video_fpsnumberno2Frames-per-second sampled from input video for analysis. · Range: 0.1 – 10
video_resolutionenumno"default"Resolution at which input video is sampled (e.g. 360p, 480p, 720p). · Allowed: default, max
temperaturenumberno0.7Sampling temperature. 0 = deterministic, 2 = maximum randomness. · Range: 0 – 2
top_pnumberno0.9Nucleus sampling probability mass. Lower = more focused. · Range: 0 – 1
max_tokensnumberno4096Maximum tokens in the response. · Range: 1 – 65536
stopstringno-Up to 4 strings where the model will stop generating further tokens.
disable_formattingbooleannofalseSkip the EmpirioLabs Markdown formatting (citation [N] rewriting + References block when web search was used). The raw upstream answer with plain [N] citations is returned.

Notes

Omnimodal input (text, image, video, audio) with text output. Web search ($0.015/call) is charged only when invoked. Cached input tokens are billed at a steep discount.

Per-tool billing (usage.tool_usage)

When this model invokes tools (web search, code interpreter, etc.) inside a single request, the response carries a normalized usage.tool_usage map alongside the token counts. The example below shows the shape — exact field names, units, and which tools appear can vary slightly per provider:

1"usage": {
2 "prompt_tokens": 123,
3 "completion_tokens": 456,
4 "cost_usd": 0.0042,
5 "tool_usage": {"web_search": 3, "code_interpreter": 1}
6}

The tool counts are already factored into cost_usd — they are surfaced for transparency so you can audit per-tool billing. The field is omitted when no tools were invoked.


Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/mimo-v2-5.