Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

https://news.ycombinator.com/rss Hits: 19

Summary

A few months after launching Qwen3-VL, Alibaba has released a detailed technical report on the open multimodal model. The data shows the system excels at image-based math tasks and can analyze hours of video footage. The system handles massive data loads, processing two-hour videos or hundreds of document pages within a 256,000-token context window. In "needle-in-a-haystack" tests, the flagship 235-billion-parameter model located individual frames in 30-minute videos with 100 percent accuracy. Even in two-hour videos containing roughly one million tokens, accuracy held at 99.5 percent. The test works by inserting a semantically important "needle" frame at random positions in long videos, which the system must then find and analyze. The needle-in-a-haystack test measures the model's ability to locate specific frames in long videos. | Image: Alibaba Share Recommend our article Share In published benchmarks, the Qwen3-VL-235B-A22B model often beats Gemini 2.5 Pro, OpenAI GPT-5, and Claude Opus 4.1 - even when competitors use reasoning features or high thinking budgets. The model dominates visual math tasks, scoring 85.8 percent on MathVista compared to GPT-5's 81.3 percent. On MathVision, it leads with 74.6 percent, ahead of Gemini 2.5 Pro (73.3 percent) and GPT-5 (65.8 percent).Ad THE DECODER Newsletter The most important AI news straight to your inbox. ✓ Weekly ✓ Free ✓ Cancel at any time Gemini's older 2.5 Pro model maintains a slight lead in general image understanding. | Image: Alibaba The model also shows range in specialized benchmarks. It scored 96.5 percent on the DocVQA document comprehension test and 875 points on OCRBench, supporting 39 languages - nearly four times as many as its predecessor. Qwen3-VL achieves over 70 percent accuracy on OCR tasks in 32 of the 39 supported languages. | Image: Alibaba Alibaba claims the system demonstrates new capabilities in GUI agent tasks. It achieved 61.8 percent accuracy on ScreenSpot Pro, which tests navigation in gra...

First seen: 2025-12-02 21:55

Last seen: 2025-12-03 15:57

Read Full Article More from this Source

Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

Summary

Related News

SSE sucks for transporting LLM tokens

Hacking Google Chrome Source Code: Make Puppeteer work over Redis PubSub

Photographer Built a Medium-Format Rangefinder, and So Can You

Fast, Memory-Efficient Hash Table in Java: Borrowing the Best Ideas

Computer Animator and Amiga fanatic Dick Van Dyke turns 100