Alibaba Cloud

Models from Alibaba Cloud

Video model offering Text-to-Video, Image-to-Video, Reference-to-Video, and Video Edit modes with high-fidelity, motion-smooth output.

HappyHorse 1.1

Text, image, and reference-to-video in one model. Cinematic motion, character consistency across up to 9 references, and synchronized native audio.

Qwen Audio 3.0 TTS

Tiered speech synthesis with over 1,000 voices, 16 languages, 20 Chinese dialects, natural-language delivery direction, and inline emotion tags.

Qwen Image 2.0

Unified image generation and editing model with class-leading complex Chinese/English text rendering, realistic textures, and multi-image fusion.

Qwen3 Max

256K-context flagship with major improvements in reasoning, instruction following, and multilingual support, plus higher coding/math accuracy.

Qwen3 Max Preview

Preview release with major gains over the 2.5 series in Chinese-English understanding, complex instructions, multilingual ability, and tool use.

Qwen3 Max Thinking

Reasoning model with adaptive tool use (search, memory, code interpreter) and test-time scaling for higher accuracy on complex tasks.

Qwen3 Rerank

Semantic document reranker. Sorts up to 500 candidates per query by relevance, supports 100+ languages, and accepts a custom sorting instruction.

Qwen3.5 122B-A10B

Qwen3.5 122B-A10B is a multimodal reasoning model with 256K context, efficient sparse MoE inference, and text, image, and video input.

Qwen3.5 27B

Qwen3.5 27B is a dense multimodal reasoning model with fast responses, 256K context, and text, image, and video understanding.

Qwen3.5 35B-A3B

Qwen3.5 35B-A3B is an efficient native vision-language model with sparse MoE routing, deep thinking, and text, image, and video input.

Qwen3.5 397B-A17B

Qwen3.5 397B-A17B is a flagship multimodal reasoning model for language, code, agents, GUI tasks, and image and video understanding.

Qwen3.5 4B

Qwen3.5 4B is a low-cost multimodal reasoning model with 256K context, image and video input, function tools, and structured output.

Qwen3.5 9B

Qwen3.5 9B is a compact multimodal reasoning model with 256K context, image and video input, function tools, and structured output.

Qwen3.5 Flash

Vision-language model with hybrid linear-attention plus sparse MoE, 1M context, and fast multimodal text/image/video inference.

Qwen3.5 Omni Flash

Cost-efficient omni-modal model handling text, image, audio, and video, with up to 3 hours of audio and 1 hour of video across 90+ languages.

Qwen3.5 Omni Plus

Flagship omni-modal model for text, image, audio, and video. 3h audio, 1h video, 90+ input and 30+ output languages, 55 voice timbres.

Qwen3.5 Plus

Multimodal model with hybrid architecture for efficient deep thinking and visual understanding across text, image, and video on a 1M context.

Qwen3.6 27B

Qwen3.6 27B improves agentic coding, STEM reasoning, spatial vision, OCR, and text, image, and video understanding on 256K context.

Qwen3.6 35B A3B

Qwen3.6 35B A3B is a 256-expert mixture-of-experts reasoning model with 128K context, function tools, and strict structured JSON output.

Qwen3.6 Flash

Fast Qwen3.6 vision-language model for agentic coding, math reasoning, spatial understanding, OCR, and text, image, and video input.

Qwen3.6 Max Preview

Largest preview variant in the 3.6 series (text-only): improved coding agent execution, stronger front-end skills, and broader long-tail knowledge.

Qwen3.6 Plus

Vision-language model with major upgrades over 3.5: agentic and front-end coding, multimodal recognition, OCR, and object localization.

Qwen3.7 Flash

Fast Qwen3.7 vision-language model for text, image, video, tool use, and agentic tasks, with implicit caching and a 1M token context.

Qwen3.7 Max

Qwen3.7 Max is a flagship text model for coding, productivity, long-running agents, deep thinking, tools, and 1M-token context.

Qwen3.7 Plus

Cost-effective Qwen3.7 vision-language model for text, image, video, coding, tool use, GUI understanding, and 1M-context workflows.

Text Embedding v4

Multilingual text embedding with selectable output dimensions (64–2048). Up to 8,192 tokens per input.

Tongyi Embedding Vision Flash

Speed-optimised multimodal embedding — same shape as Vision-Plus, 3× cheaper image/video tokens.

Tongyi Embedding Vision Plus

Multimodal embedding producing independent vectors for text, image, and video inputs.

Wan 2.6

Multimodal video generation model for cinematic, multi-shot stories with native audio-visual sync (lip-sync, dialogue, music, SFX).

Wan 2.7

Multimodal video model supporting T2V, I2V, video editing, and reference-to-video, with high-fidelity output from text, image, or video inputs.

Wan2.7 Image

Image generation and editing companion model: text-to-image, bounding-box edits, and cohesive image sets, with up to 4K output on Pro.