FuguReport

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Authors Xiang An, Yin Xie, Feilong Tang, Yunyao Yan, Huajie Tan, Didi Zhu, Changrui Chen, Xiuwei Zhao, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Kaichen Zhang, Wenkang Zhang, Zheng Cheng, Nansen Zhang, Chunsheng Wu, Chunjiang Ge, Zimin Ran, Dehua Song, Chunyuan Li, Shikun Feng, Ming Hu, Zhangquan Chen, Junbo Niu, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng
Affiliations AIM for Health Lab / Glint Lab / MVP Lab
Categories Evaluation / Multimodal Benchmarking / Performance on diverse multimodal benchmarks, Application / Perceptual Intelligence / Unified recognition across temporal and spatial modalities, Method / Multimodal Fusion / Integrated video and language reasoning
License CC BY 4.0

Abstract Overview

LLaVA-OneVision-2 is an 8B-class multimodal model aimed at unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-related reasoning. Its central design is codec-stream tokenization, which uses compressed-video bit-cost dynamics and motion-residual cues to adaptively allocate visual tokens over time and space, rather than relying only on uniformly sampled frames. The model is trained with a four-stage recipe that combines inherited image-text and instruction data with about 8 million captioned video samples and a 4 million-sample spatial corpus. The paper also introduces JumpScore, a benchmark for fine-grained temporal localization in dense repetitive motion, to evaluate capabilities that are underrepresented in existing video benchmarks.

Novelty

The paper’s main novelty is a codec-aligned input pipeline for multimodal modeling: compressed video is treated as a continuous bit-cost stream, with adaptive temporal grouping and saliency-driven spatial token selection packed into compact canvases. It also contributes JumpScore as a new benchmark focused on cycle-level temporal grounding in high-frequency repeated motion.

Results

On JumpScore, LLaVA-OneVision-2-8B achieves 74.9 mAP, compared with 30.1 for Qwen3-VL-8B, and codec-stream inputs improve temporal grounding over frame sampling by 9.7 points on average under matched visual-token budgets. The model reports average gains over Qwen3-VL-8B of 4.3 points across 18 video tasks, 5.3 points across 11 spatial benchmarks, and 15.6 J&F points across 4 tracking tasks. It remains competitive on image and document benchmarks, though the paper notes it is not specialized for OCR- or document-heavy tasks.

Key Points

  1. Codec-stream tokenization adaptively assigns visual evidence using compressed-video bit-cost and motion-residual signals, enabling more stable long-video compression than fixed GOP or uniform frame sampling.
  2. The training recipe combines large-scale open supervision, including roughly 8 million re-captioned video samples and a 4 million-sample 2D/3D spatial corpus, with codec-stream training introduced specifically in the final long-video stage.
  3. Empirically, the strongest reported gains are in temporal grounding, spatial reasoning, and tracking, especially on the newly introduced JumpScore benchmark for repeated-motion localization.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.