Alibaba report shows its Qwen3-VL multimodal AI suite

A few months after launching its Qwen3-VL multimodal AI suite, Alibaba has released a detailed technical report showing the system outperforming leading commercial models, including OpenAI’s GPT-5, Google’s Gemini 2.5 Pro and Anthropic’s Claude Opus 4.1, in a range of visual math, document comprehension and long-video analysis benchmarks. The findings highlight significant advances in open-source AI, particularly in tasks involving large volumes of visual and text data.

Excels at Math, Documents and Long-Video Tasks

According to Alibaba, the flagship Qwen3-VL-235B-A22B model demonstrates exceptional performance on image-based mathematics, scoring 85.8% on MathVista, compared with 81.3% for GPT-5. On MathVision, the model leads with 74.6%, ahead of Gemini 2.5 Pro and GPT-5.

The system shows similar gains in document intelligence. It reached 96.5% on the DocVQA test, handled OCR tasks in 39 languages, and scored 875 points on OCRBench, nearly quadrupling the multilingual coverage of its predecessor.

Qwen3-VL’s ability to process extremely long inputs also sets it apart. The model handles two-hour videos, hundreds of document pages and other large data sets within a 256,000-token context window, one of the longest available among open models.

In “needle-in-a-haystack” video tests — where a single meaningful frame is hidden inside long footage — the 235B model identified frames in 30-minute videos with 100% accuracy, and in two-hour videos with 99.5% accuracy.

Beats GPT-5, Gemini in Many Public Benchmarks

The new report shows Qwen3-VL outperforming top commercial systems on several high-profile benchmarks, even when competitors use extended reasoning modes or higher “thinking budgets.”

Notable results include:

  • MathVista: 85.8% (Qwen3-VL) vs. 81.3% (GPT-5)
  • MathVision: 74.6% (Qwen3-VL) vs. 73.3% (Gemini 2.5 Pro)
  • DocVQA: 96.5%
  • CharXiv (scientific charts): 90.5% on description tasks
  • MMLongBench-Doc: 56.2% on long-document comprehension

The model also demonstrates strong performance in GUI-based tasks. Qwen3-VL-235B-A22B scored 61.8% on ScreenSpot Pro, which evaluates navigation in graphical interfaces. The mid-sized Qwen3-VL-32B model reached 63.7% on AndroidWorld, a benchmark requiring autonomous operation of Android apps.

Still Trails Rivals in Some General Reasoning Tests

Despite broad strengths, Qwen3-VL is not dominant across all categories. In the complex MMMU-Pro benchmark, the model scored 69.3%, trailing GPT-5’s 78.4%. Alibaba researchers also acknowledged that commercial systems tend to hold an edge in video question-answering tasks requiring deeper general reasoning.

Analysts say the results suggest Qwen3-VL is emerging as a specialist in visual math, documents and long-video comprehension, while commercial models continue to lead in higher-level reasoning.

Architectural Upgrades Behind the Gains

Alibaba credits the improvements to three major changes in model design:

Interleaved MRoPE

The older grouped positional encoding method has been replaced by “interleaved MRoPE,” which distributes positional information evenly across dimensions. This upgrade is designed to improve stability and accuracy on long-video tasks.

DeepStack Access to Intermediate Vision Features

The model can now pull intermediate results from the vision encoder rather than only the final output, giving it finer-grained visual understanding across multiple layers of detail.

Text-Based Timestamp System

A new text-based time-marker approach replaces the complex T-RoPE system. Instead of mathematically encoding every frame, the model receives simple markers such as “<3.8 seconds>”, improving its ability to reason about temporal sequences in videos.

Training on One Trillion Tokens

Qwen3-VL was trained in four phases using up to 10,000 GPUs, incorporating:

  • One trillion multimodal tokens
  • Web-scraped images and text
  • Three million PDFs from Common Crawl
  • Over 60 million STEM tasks

The context window was expanded from 8,000 to 32,000, and ultimately to 262,000 tokens. “Thinking” variants of the model received explicit chain-of-thought supervision to improve reasoning clarity on complex tasks.

Fully Open Source Under Apache 2.0

All Qwen3-VL models — from 2B to 32B dense variants, to the 30B-A3B and the large 235B-A22B MoE version — are released with open weights under the Apache 2.0 license on Hugging Face.

While long-context video extraction is not new — Google’s Gemini 1.5 Pro demonstrated similar capabilities in early 2024 — Alibaba’s model delivers competitive results in an open package, positioning it to play a major role in academic and open-source research.

For more similar news and reports go to the home page of The Gignomist