RIVER: A Real-Time Interaction Benchmark for Video LLMs
- URL: http://arxiv.org/abs/2603.03985v1
- Date: Wed, 04 Mar 2026 12:29:34 GMT
- Title: RIVER: A Real-Time Interaction Benchmark for Video LLMs
- Authors: Yansong Shi, Qingsong Zhao, Tianxiang Jiang, Xiangyu Zeng, Yi Wang, Limin Wang,
- Abstract summary: Real-tIme Video intERaction Bench (RIVER Bench) is designed for evaluating online video comprehension.<n> RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks.<n>We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format.
- Score: 14.889554695311057
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of multimodal large language models has demonstrated impressive capabilities, yet nearly all operate in an offline paradigm, hindering real-time interactivity. Addressing this gap, we introduce the Real-tIme Video intERaction Bench (RIVER Bench), designed for evaluating online video comprehension. RIVER Bench introduces a novel framework comprising Retrospective Memory, Live-Perception, and Proactive Anticipation tasks, closely mimicking interactive dialogues rather than responding to entire videos at once. We conducted detailed annotations using videos from diverse sources and varying lengths, and precisely defined the real-time interactive format. Evaluations across various model categories reveal that while offline models perform well in single question-answering tasks, they struggle with real-time processing. Addressing the limitations of existing models in online video interaction, especially their deficiencies in long-term memory and future perception, we proposed a general improvement method that enables models to interact with users more flexibly in real time. We believe this work will significantly advance the development of real-time interactive video understanding models and inspire future research in this emerging field. Datasets and code are publicly available at https://github.com/OpenGVLab/RIVER.
Related papers
- LiViBench: An Omnimodal Benchmark for Interactive Livestream Video Understanding [23.207637210563504]
LiViBench is an omnimodal benchmark for interactive livestream videos.<n>It features a diverse set of 24 tasks, highlighting the perceptual, reasoning, and livestream-specific challenges.<n>We develop LiVi-LLM-7B, an MLLM with enhanced knowledge of interactive livestreams.
arXiv Detail & Related papers (2026-01-21T14:14:20Z) - MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning [46.07421375983852]
We propose a novel text-to-text approach to proactive interaction.<n>We train our model MMDuet2 on a dataset of 52k videos with two types of dialogues via SFT and RL.<n> Experimental results demonstrate that MMDuet2 outperforms existing proactive Video MLLM baselines in response timing and quality.
arXiv Detail & Related papers (2025-12-07T12:03:04Z) - IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video? [56.33950760097989]
IWR-Bench is a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video.<n>IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions.<n>This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code.
arXiv Detail & Related papers (2025-09-29T12:38:06Z) - Matrix-Game 2.0: An Open-Source, Real-Time, and Streaming Interactive World Model [15.16063778402193]
Matrix-Game 2.0 is an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion.<n>It can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS.
arXiv Detail & Related papers (2025-08-18T15:28:53Z) - Yan: Foundational Interactive Video Generation [25.398980906541524]
Yan is a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing.<n>We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process.<n>We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text.
arXiv Detail & Related papers (2025-08-12T03:34:21Z) - ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts [56.75723197779384]
ARC-Hunyuan-Video is a multimodal model that processes visual, audio, and textual signals end-to-end for structured comprehension.<n>Our model is capable of multi-granularity timestamped video captioning and summarization, open-ended video question answering, temporal video grounding, and video reasoning.
arXiv Detail & Related papers (2025-07-28T15:52:36Z) - LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z) - Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.<n>We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.<n>Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z) - Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.<n>We introduce a new approach that models video-text as game players using multivariate cooperative game theory.<n>We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.90258030688256]
This project introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input.<n>This project simulates human-like cognition, enabling multimodal large language models to provide continuous and adaptive service over time.
arXiv Detail & Related papers (2024-12-12T18:58:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.