GLM 5.1

Z.ai · Text Generation

POST /v1/chat/completions

Long-context Zhipu AI reasoning model with 202K context, 128K output, tool calling, structured output, and cache support.

At a glance

Field	Value
Model id	`glm-5-1`
Model release date	2026-04-07
Input modalities	Text
Output modalities	Text
Context window	202K
Weight precision	-
Region	China
Features	reasoning, function_calling, cache
Native inference	No
New	Yes
Structured output	JSON Schema
Supported endpoints	`POST /v1/chat/completions`, `POST /v1/responses`, `POST /v1/messages`, `POST /v1beta/models/glm-5-1:generateContent`

Pricing

Charge	Spec	Rate
Input	per 1M prompt tokens	<=32K $0.825 (was $1.40); 32K-200K $1.10 (was $1.40)
Output	per 1M generated tokens	<=32K $3.301 (was $4.40); 32K-200K $3.851 (was $4.40)
Implicit cache read	per 1M cached input tokens	<=32K $0.165 (was $0.26); 32K-200K $0.22 (was $0.26)
Web search	per request when enabled	$0.013

Example request

$ curl https://api.empiriolabs.ai/v1/chat/completions \
>   -H 'Authorization: Bearer $EMPIRIOLABS_API_KEY' \
>   -H 'Content-Type: application/json' \
>   -d '{"model": "glm-5-1", "messages": [{"role":"user","content":"Hello"}]}'

Parameters

Parameter	Type	Required	Default	Description
`max_tokens`	integer	no	`4096`	Maximum number of output tokens to generate. · Range: 1 – 128000
`temperature`	number	no	`1`	Controls randomness. Lower values make responses more deterministic. · Range: 0 – 2
`top_p`	number	no	`0.95`	Nucleus sampling cutoff. · Range: 0 – 1
`top_k`	integer	no	`20`	Limits sampling to the top K tokens. · Range: 1 – 100
`repetition_penalty`	number	no	`1`	Penalizes repeated tokens. · Range: 0.1 – 2
`reasoning_effort`	enum	no	`"medium"`	Reasoning effort level. none disables thinking. low, medium, high, and max set bounded thinking budgets sized to the selected model. Sent as an OpenAI-style reasoning_effort field, translated into enable_thinking and thinking_budget for the model service. · Allowed: `none`, `low`, `medium`, `high`, `max`
`enable_thinking`	boolean	no	true	Allow the model to reason before answering. Disable this for strict structured output.
`thinking_budget`	integer	no	`32768`	Maximum tokens available for reasoning content when thinking is enabled. · Range: 1 – 38912
`tool_stream`	boolean	no	false	Stream function-call arguments incrementally when streaming.
`tools`	array	no	`[]`	OpenAI-compatible function calling tool definitions.
`tool_choice`	object	no	-	OpenAI-compatible tool choice control.
`parallel_tool_calls`	boolean	no	true	Allow multiple tool calls in a single assistant turn when supported.
`stop`	array	no	-	Optional stop sequences.
`response_format`	enum	no	-	Constrain the output to JSON. Use JSON mode for any valid JSON object, or JSON schema to force output that matches a schema you provide.
`web_search_linkup`	boolean	no	false	Optional web search powered by Linkup. When enabled, recent web sources are retrieved using your latest user message as the query and provided to the model as additional context. Adds $0.013 per call when invoked on top of the model’s normal token cost. Disabled by default.
`disable_formatting`	boolean	no	false	When enabled, the gateway will not append the “Sources” footer to assistant responses that used Linkup web search. Useful when the model output is piped to another system that expects no decoration.

Machine-readable schema: GET https://api.empiriolabs.ai/v1/models/glm-5-1.