Qwen3.5 4B

Qwen3.5 4B
Alibaba Cloud · Text Generation
POST /v1/chat/completions

Qwen3.5 4B is a low-cost multimodal reasoning model with 256K context, image and video input, function tools, and structured output.

At a glance

FieldValue
Model idqwen3-5-4b
Input modalitiesText, Image, Video
Output modalitiesText
Context window256K
Weight precisionFP8 weights + FP8 KV
Max output tokens32,768
Featuresreasoning, vision, video, function_calling, structured_output, cache, multimodal, json_mode, logprobs
Native inferenceYes
NewYes
Supported endpointsPOST /v1/chat/completions, POST /v1/responses, POST /v1/messages, POST /v1/completions

Pricing

ChargeSpecRate
Inputper 1M prompt tokens$0.04
Outputper 1M generated tokens$0.07
Implicit cache readper 1M cached input tokens$0.02

Example request

$curl https://api.empiriolabs.ai/v1/chat/completions \
> -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
> -H 'Content-Type: application/json' \
> -d '{"model": "qwen3-5-4b", "messages": [{"role":"user","content":"Hello"}]}'

Parameters

ParameterTypeRequiredDefaultDescription
temperaturenumberno0.7Sampling temperature. 0 is deterministic and 2 is maximum randomness. · Range: 0 – 2
top_pnumberno0.95Nucleus sampling probability mass. Lower values make outputs more focused. · Range: 0 – 1
max_tokensintegerno4096Maximum output tokens. · Range: 1 – 32768
stopstringnoUp to 4 strings where the model will stop generating further tokens.
reasoning_effortenumno"medium"Reasoning effort. none disables thinking; low, medium, high, and max set bounded thinking budgets. · Allowed: none, low, medium, high, max
enable_thinkingbooleannotrueEnable the model reasoning channel before final output.
thinking_budgetintegerno4096Maximum thinking tokens before the final answer. If max_tokens is lower, the service reserves room for the answer. · Range: 1024 – 32768
top_kintegerno20Limit sampling to the top K candidate tokens when supported. · Range: 1 – 200
min_pnumberno0Minimum probability threshold for token sampling. · Range: 0 – 1
presence_penaltynumberno0Penalty for tokens that already appeared in the generated text. · Range: -2 – 2
frequency_penaltynumberno0Penalty based on how often a token has already appeared. · Range: -2 – 2
repetition_penaltynumberno1Penalty used by SGLang to reduce repeated text. · Range: 0.1 – 2
seedintegernoOptional random seed for reproducible sampling. · Range: 0 – 2147483647
logprobsbooleannofalseReturn token log probabilities when supported.
top_logprobsintegernoReturn up to this many top token log probabilities. · Range: 0 – 20
logit_biasobjectnoBias token IDs by adding positive or negative values before sampling.
toolsarraynoOpenAI-compatible function tool definitions.
tool_choiceobjectnoOpenAI-compatible function tool selection.
response_formatobjectnoStructured JSON output instructions.
streambooleannofalseStream response deltas using server-sent events.

Notes

Supports text, image, and video input, streaming, function tools, structured JSON output, seed control, and thinking mode on by default. Use reasoning_effort or thinking_budget for bounded thinking, or enable_thinking=false for direct answers. Automatic cache reads are billed at the cached-input rate when reported by the model service. Explicit cache controls are not supported.


Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/qwen3-5-4b.