LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
Abstract Overview
LLaVA-OneVision-2 is an 8B-class multimodal model aimed at unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-related reasoning. Its central design is codec-stream tokenization, which uses compressed-video bit-cost dynamics and motion-residual cues to adaptively allocate visual tokens over time and space, rather than relying only on uniformly sampled frames. The model is trained with a four-stage recipe that combines inherited image-text and instruction data with about 8 million captioned video samples and a 4 million-sample spatial corpus. The paper also introduces JumpScore, a benchmark for fine-grained temporal localization in dense repetitive motion, to evaluate capabilities that are underrepresented in existing video benchmarks.
Novelty
The paper’s main novelty is a codec-aligned input pipeline for multimodal modeling: compressed video is treated as a continuous bit-cost stream, with adaptive temporal grouping and saliency-driven spatial token selection packed into compact canvases. It also contributes JumpScore as a new benchmark focused on cycle-level temporal grounding in high-frequency repeated motion.
Results
On JumpScore, LLaVA-OneVision-2-8B achieves 74.9 mAP, compared with 30.1 for Qwen3-VL-8B, and codec-stream inputs improve temporal grounding over frame sampling by 9.7 points on average under matched visual-token budgets. The model reports average gains over Qwen3-VL-8B of 4.3 points across 18 video tasks, 5.3 points across 11 spatial benchmarks, and 15.6 J&F points across 4 tracking tasks. It remains competitive on image and document benchmarks, though the paper notes it is not specialized for OCR- or document-heavy tasks.
Key Points
- Codec-stream tokenization adaptively assigns visual evidence using compressed-video bit-cost and motion-residual signals, enabling more stable long-video compression than fixed GOP or uniform frame sampling.
- The training recipe combines large-scale open supervision, including roughly 8 million re-captioned video samples and a 4 million-sample 2D/3D spatial corpus, with codec-stream training introduced specifically in the final long-video stage.
- Empirically, the strongest reported gains are in temporal grounding, spatial reasoning, and tracking, especially on the newly introduced JumpScore benchmark for repeated-motion localization.