Alibaba Cloud

Alibaba Cloud
Models from Alibaba Cloud

HappyHorse 1.0
Video model offering Text-to-Video, Image-to-Video, Reference-to-Video, and Video Edit modes with high-fidelity, motion-smooth output.

Qwen Image 2.0
Unified image generation and editing model with class-leading complex Chinese/English text rendering, realistic textures, and multi-image fusion.

Qwen3 Max
256K-context flagship with major improvements in reasoning, instruction following, and multilingual support, plus higher coding/math accuracy.

Qwen3 Max Preview
Preview release with major gains over the 2.5 series in Chinese-English understanding, complex instructions, multilingual ability, and tool use.

Qwen3 Max Thinking
Reasoning model with adaptive tool use (search, memory, code interpreter) and test-time scaling for higher accuracy on complex tasks.

Qwen3 Rerank
Semantic document reranker. Sorts up to 500 candidates per query by relevance, supports 100+ languages, and accepts a custom sorting instruction.

Qwen3.5 122B-A10B
Qwen3.5 122B-A10B is a multimodal reasoning model with 256K context, efficient sparse MoE inference, and text, image, and video input.

Qwen3.5 27B
Qwen3.5 27B is a dense multimodal reasoning model with fast responses, 256K context, and text, image, and video understanding.

Qwen3.5 35B-A3B
Qwen3.5 35B-A3B is an efficient native vision-language model with sparse MoE routing, deep thinking, and text, image, and video input.

Qwen3.5 397B-A17B
Qwen3.5 397B-A17B is a flagship multimodal reasoning model for language, code, agents, GUI tasks, and image and video understanding.

Qwen3.5 4B
Qwen3.5 4B is a low-cost multimodal reasoning model with 256K context, image and video input, function tools, and structured output.

Qwen3.5 9B
Qwen3.5 9B is a compact multimodal reasoning model with 256K context, image and video input, function tools, and structured output.

Qwen3.5 Flash
Vision-language model with hybrid linear-attention plus sparse MoE, 1M context, and fast multimodal text/image/video inference.

Qwen3.5 Omni Flash
Cost-efficient omni-modal model handling text, image, audio, and video, with up to 3 hours of audio and 1 hour of video across 90+ languages.

Qwen3.5 Omni Plus
Flagship omni-modal model for text, image, audio, and video. 3h audio, 1h video, 90+ input and 30+ output languages, 55 voice timbres.

Qwen3.5 Plus
Multimodal model with hybrid architecture for efficient deep thinking and visual understanding across text, image, and video on a 1M context.

Qwen3.6 27B
Qwen3.6 27B improves agentic coding, STEM reasoning, spatial vision, OCR, and text, image, and video understanding on 256K context.

Qwen3.6 Flash
Fast Qwen3.6 vision-language model for agentic coding, math reasoning, spatial understanding, OCR, and text, image, and video input.

Qwen3.6 Max Preview
Largest preview variant in the 3.6 series (text-only): improved coding agent execution, stronger front-end skills, and broader long-tail knowledge.

Qwen3.6 Plus
Vision-language model with major upgrades over 3.5: agentic and front-end coding, multimodal recognition, OCR, and object localization.

Qwen3.7 Max
Qwen3.7 Max is a flagship text model for coding, productivity, long-running agents, deep thinking, tools, and 1M-token context.

Qwen3.7 Plus
Cost-effective Qwen3.7 vision-language model for text, image, video, coding, tool use, GUI understanding, and 1M-context workflows.

Text Embedding v4
Multilingual text embedding with selectable output dimensions (64–2048). Up to 8,192 tokens per input.

Tongyi Embedding Vision Flash
Speed-optimised multimodal embedding — same shape as Vision-Plus, 3× cheaper image/video tokens.

Tongyi Embedding Vision Plus
Multimodal embedding producing independent vectors for text, image, and video inputs.

Wan 2.6
Multimodal video generation model for cinematic, multi-shot stories with native audio-visual sync (lip-sync, dialogue, music, SFX).

Wan 2.7
Multimodal video model supporting T2V, I2V, video editing, and reference-to-video, with high-fidelity output from text, image, or video inputs.

Wan2.7 Image
Image generation and editing companion model: text-to-image, bounding-box edits, and cohesive image sets, with up to 4K output on Pro.
