Qwen3.5 4B | EmpirioLabs AI Docs

POST /v1/chat/completions

Qwen3.5 4B is a low-cost multimodal reasoning model with 256K context, image and video input, function tools, and structured output.

At a glance

Field	Value
Model id	`qwen3-5-4b`
Model release date	2026-03-02
Input modalities	Text, Image, Video
Output modalities	Text
Context window	256K
Weight precision	FP8 weights + FP8 KV
Max output tokens	32,768
Features	reasoning, vision, video, function_calling, cache, multimodal, json_mode, logprobs
Native inference	Yes
New	Yes
Structured output	JSON Schema
Supported endpoints	`POST /v1/chat/completions`, `POST /v1/responses`, `POST /v1/messages`, `POST /v1/completions`, `POST /v1beta/models/qwen3-5-4b:generateContent`

Pricing

Charge	Spec	Rate
Input	per 1M prompt tokens	$0.04
Output	per 1M generated tokens	$0.07
Implicit cache read	per 1M cached input tokens	$0.02
Web search	per request when enabled	$0.013

Example request

$ curl https://api.empiriolabs.ai/v1/chat/completions \
>   -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
>   -H 'Content-Type: application/json' \
>   -d '{"model": "qwen3-5-4b", "messages": [{"role":"user","content":"Hello"}]}'

Parameters

Parameter	Type	Required	Default	Description
`temperature`	number	no	`0.7`	Sampling temperature. 0 is deterministic and 2 is maximum randomness. · Range: 0 – 2
`top_p`	number	no	`0.95`	Nucleus sampling probability mass. Lower values make outputs more focused. · Range: 0 – 1
`max_tokens`	integer	no	`4096`	Maximum output tokens. · Range: 1 – 32768
`stop`	string	no	-	Up to 4 strings where the model will stop generating further tokens.
`reasoning_effort`	enum	no	`"medium"`	Reasoning effort. none disables thinking; low, medium, high, and max set bounded thinking budgets. · Allowed: `none`, `low`, `medium`, `high`, `max`
`enable_thinking`	boolean	no	true	Enable the model reasoning channel before final output.
`thinking_budget`	integer	no	`4096`	Maximum thinking tokens before the final answer. If max_tokens is lower, the service reserves room for the answer. · Range: 1024 – 32768
`top_k`	integer	no	`20`	Limit sampling to the top K candidate tokens when supported. · Range: 1 – 200
`min_p`	number	no	`0`	Minimum probability threshold for token sampling. · Range: 0 – 1
`presence_penalty`	number	no	`0`	Penalty for tokens that already appeared in the generated text. · Range: -2 – 2
`frequency_penalty`	number	no	`0`	Penalty based on how often a token has already appeared. · Range: -2 – 2
`repetition_penalty`	number	no	`1`	Penalty used by SGLang to reduce repeated text. · Range: 0.1 – 2
`seed`	integer	no	-	Optional random seed for reproducible sampling. · Range: 0 – 2147483647
`logprobs`	boolean	no	false	Return token log probabilities when supported.
`top_logprobs`	integer	no	-	Return up to this many top token log probabilities. · Range: 0 – 20
`logit_bias`	object	no	-	Bias token IDs by adding positive or negative values before sampling.
`tools`	array	no	-	OpenAI-compatible function tool definitions.
`tool_choice`	object	no	-	OpenAI-compatible function tool selection.
`stream`	boolean	no	false	Stream response deltas using server-sent events.
`response_format`	enum	no	-	Constrain the output to JSON. Use JSON mode for any valid JSON object, or JSON schema to force output that matches a schema you provide.
`web_search_linkup`	boolean	no	false	Optional web search powered by Linkup. When enabled, recent web sources are retrieved using your latest user message as the query and provided to the model as additional context. Adds $0.013 per call when invoked on top of the model’s normal token cost. Disabled by default.
`disable_formatting`	boolean	no	false	When enabled, the gateway will not append the “Sources” footer to assistant responses that used Linkup web search. Useful when the model output is piped to another system that expects no decoration.

Notes

Supports text, image, and video input, streaming, function tools, structured JSON output, seed control, and thinking mode on by default. Use reasoning_effort or thinking_budget for bounded thinking, or enable_thinking=false for direct answers. Automatic cache reads are billed at the cached-input rate when reported by the model service. Explicit cache controls are not supported.

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/qwen3-5-4b.