Alibaba Cloud, the prominent Chinese technology firm, introduced its latest visual-language model, Qwen2.5-VL, on Wednesday. This new release marks a substantial advancement over its predecessor, Qwen2-VL.
The open-source, multimodal model is available in a range of configurations, with parameter sizes of 3 billion, 7 billion, and 72 billion. Additionally, it includes both foundational and instruction-tuned versions.
According to a statement from Alibaba’s cloud division, featured on its official WeChat account, "Qwen 2.5-Max surpasses almost all metrics when compared with GPT-4o, DeepSeek-V3, and Llama-3.1-405B," referencing the most sophisticated open-source AI models by OpenAI and Meta.
The premier model, Qwen2.5-VL-72B-Instruct, is now accessible via the Qwen Chat platform. Meanwhile, the entire Qwen2.5-VL series is available on Hugging Face and Alibaba's open-source community, Model Scope.
Alibaba asserts that the Qwen2.5-VL model exhibits exceptional multimodal proficiency, excelling in the advanced visual analysis of texts, charts, diagrams, and graphics within images. Moreover, it can interpret videos exceeding one hour in duration, respond to video-related inquiries, and precisely identify specific segments to the exact second.