Alibaba Cloud

Alibaba Cloud

Alibaba Cloud

Models from Alibaba Cloud

HappyHorse 1.0
HappyHorse 1.0
Video model offering Text-to-Video, Image-to-Video, Reference-to-Video, and Video Edit modes with high-fidelity, motion-smooth output.
Qwen Image 2.0
Qwen Image 2.0
Unified image generation and editing model with class-leading complex Chinese/English text rendering, realistic textures, and multi-image fusion.
Qwen3 Max
Qwen3 Max
256K-context flagship with major improvements in reasoning, instruction following, and multilingual support, plus higher coding/math accuracy.
Qwen3 Max Preview
Qwen3 Max Preview
Preview release with major gains over the 2.5 series in Chinese-English understanding, complex instructions, multilingual ability, and tool use.
Qwen3 Max Thinking
Qwen3 Max Thinking
Reasoning model with adaptive tool use (search, memory, code interpreter) and test-time scaling for higher accuracy on complex tasks.
Qwen3 Rerank
Qwen3 Rerank
Semantic document reranker. Sorts up to 500 candidates per query by relevance, supports 100+ languages, and accepts a custom sorting instruction.
Qwen3.5 122B-A10B
Qwen3.5 122B-A10B
Qwen3.5 122B-A10B is a multimodal reasoning model with 256K context, efficient sparse MoE inference, and text, image, and video input.
Qwen3.5 27B
Qwen3.5 27B
Qwen3.5 27B is a dense multimodal reasoning model with fast responses, 256K context, and text, image, and video understanding.
Qwen3.5 35B-A3B
Qwen3.5 35B-A3B
Qwen3.5 35B-A3B is an efficient native vision-language model with sparse MoE routing, deep thinking, and text, image, and video input.
Qwen3.5 397B-A17B
Qwen3.5 397B-A17B
Qwen3.5 397B-A17B is a flagship multimodal reasoning model for language, code, agents, GUI tasks, and image and video understanding.
Qwen3.5 4B
Qwen3.5 4B
Qwen3.5 4B is a low-cost multimodal reasoning model with 256K context, image and video input, function tools, and structured output.
Qwen3.5 9B
Qwen3.5 9B
Qwen3.5 9B is a compact multimodal reasoning model with 256K context, image and video input, function tools, and structured output.
Qwen3.5 Flash
Qwen3.5 Flash
Vision-language model with hybrid linear-attention plus sparse MoE, 1M context, and fast multimodal text/image/video inference.
Qwen3.5 Omni Flash
Qwen3.5 Omni Flash
Cost-efficient omni-modal model handling text, image, audio, and video, with up to 3 hours of audio and 1 hour of video across 90+ languages.
Qwen3.5 Omni Plus
Qwen3.5 Omni Plus
Flagship omni-modal model for text, image, audio, and video. 3h audio, 1h video, 90+ input and 30+ output languages, 55 voice timbres.
Qwen3.5 Plus
Qwen3.5 Plus
Multimodal model with hybrid architecture for efficient deep thinking and visual understanding across text, image, and video on a 1M context.
Qwen3.6 27B
Qwen3.6 27B
Qwen3.6 27B improves agentic coding, STEM reasoning, spatial vision, OCR, and text, image, and video understanding on 256K context.
Qwen3.6 Flash
Qwen3.6 Flash
Fast Qwen3.6 vision-language model for agentic coding, math reasoning, spatial understanding, OCR, and text, image, and video input.
Qwen3.6 Max Preview
Qwen3.6 Max Preview
Largest preview variant in the 3.6 series (text-only): improved coding agent execution, stronger front-end skills, and broader long-tail knowledge.
Qwen3.6 Plus
Qwen3.6 Plus
Vision-language model with major upgrades over 3.5: agentic and front-end coding, multimodal recognition, OCR, and object localization.
Qwen3.7 Max
Qwen3.7 Max
Qwen3.7 Max is a flagship text model for coding, productivity, long-running agents, deep thinking, tools, and 1M-token context.
Qwen3.7 Plus
Qwen3.7 Plus
Cost-effective Qwen3.7 vision-language model for text, image, video, coding, tool use, GUI understanding, and 1M-context workflows.
Text Embedding v4
Text Embedding v4
Multilingual text embedding with selectable output dimensions (64–2048). Up to 8,192 tokens per input.
Tongyi Embedding Vision Flash
Tongyi Embedding Vision Flash
Speed-optimised multimodal embedding — same shape as Vision-Plus, 3× cheaper image/video tokens.
Tongyi Embedding Vision Plus
Tongyi Embedding Vision Plus
Multimodal embedding producing independent vectors for text, image, and video inputs.
Wan 2.6
Wan 2.6
Multimodal video generation model for cinematic, multi-shot stories with native audio-visual sync (lip-sync, dialogue, music, SFX).
Wan 2.7
Wan 2.7
Multimodal video model supporting T2V, I2V, video editing, and reference-to-video, with high-fidelity output from text, image, or video inputs.
Wan2.7 Image
Wan2.7 Image
Image generation and editing companion model: text-to-image, bounding-box edits, and cohesive image sets, with up to 4K output on Pro.