CyberV: Cybernetics for Test-time Scaling in Video Understanding
- URL: http://arxiv.org/abs/2506.07971v1
- Date: Mon, 09 Jun 2025 17:45:18 GMT
- Title: CyberV: Cybernetics for Test-time Scaling in Video Understanding
- Authors: Jiahao Meng, Shuyang Sun, Yue Tan, Lu Qi, Yunhai Tong, Xiangtai Li, Longyin Wen,
- Abstract summary: Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos.<n>We propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems.<n>Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller.
- Score: 46.693118184157626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current Multimodal Large Language Models (MLLMs) may struggle with understanding long or complex videos due to computational demands at test time, lack of robustness, and limited accuracy, primarily stemming from their feed-forward processing nature. These limitations could be more severe for models with fewer parameters. To address these limitations, we propose a novel framework inspired by cybernetic principles, redesigning video MLLMs as adaptive systems capable of self-monitoring, self-correction, and dynamic resource allocation during inference. Our approach, CyberV, introduces a cybernetic loop consisting of an MLLM Inference System, a Sensor, and a Controller. Specifically, the sensor monitors forward processes of the MLLM and collects intermediate interpretations, such as attention drift, then the controller determines when and how to trigger self-correction and generate feedback to guide the next round. This test-time adaptive scaling framework enhances frozen MLLMs without requiring retraining or additional components. Experiments demonstrate significant improvements: CyberV boosts Qwen2.5-VL-7B by 8.3% and InternVL3-8B by 5.5% on VideoMMMU, surpassing the competitive proprietary model GPT-4o. When applied to Qwen2.5-VL-72B, it yields a 10.0% improvement, achieving performance even comparable to human experts. Furthermore, our method demonstrates consistent gains on general-purpose benchmarks, such as VideoMME and WorldSense, highlighting its effectiveness and generalization capabilities in making MLLMs more robust and accurate for dynamic video understanding. The code is released at https://github.com/marinero4972/CyberV.
Related papers
- Let's Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification [17.67273082468732]
Verifiers -- functions assigning rewards to agent behavior -- have been key for AI progress in domains like math and board games.<n>We evaluate Multimodal Large Language Models (MLLMs) as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation.<n>We propose Self-Grounded Verification (SGV), a lightweight method that enables more effective use of MLLMs' knowledge and reasoning.
arXiv Detail & Related papers (2025-07-15T18:50:29Z) - Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought [58.321044666612174]
Vad-R1 is an end-to-end MLLM-based framework for Video Anomaly Reasoning.<n>We design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies.<n>We also propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs.
arXiv Detail & Related papers (2025-05-26T12:05:16Z) - RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video [19.373906873461703]
RTV-Bench is a fine-grained benchmark for MLLM real-time video analysis.<n>RTV-Bench contains 552 diverse videos (167.2 hours) and 4,631 high-quality QA pairs.
arXiv Detail & Related papers (2025-05-04T10:55:21Z) - InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.<n>We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.<n> Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z) - MVTamperBench: Evaluating Robustness of Vision-Language Models [5.062181035021214]
We introduce textbfMVTamperBench, a benchmark that systematically evaluates MLLM robustness against five prevalent tampering techniques.<n>MVTamperBench comprises 3.4K original videos, expanded into over 17K tampered clips covering 19 distinct video manipulation tasks.
arXiv Detail & Related papers (2024-12-27T18:47:05Z) - Unveiling Uncertainty: A Deep Dive into Calibration and Performance of Multimodal Large Language Models [36.81503322875839]
Multimodal large language models (MLLMs) combine visual and textual data for tasks such as image captioning and visual question answering.<n>This paper investigates representative MLLMs, focusing on their calibration across various scenarios.<n>We observed miscalibration in their performance, and at the same time, no significant differences in calibration across these scenarios.
arXiv Detail & Related papers (2024-12-19T09:10:07Z) - COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework [12.255542503707172]
COEF-VQ is a novel cascaded MLLM framework designed to enhance video quality understanding on the short-video platform.<n>Our framework significantly reduces GPU usage while maintaining the strong classification performance of a full MLLM deployment.
arXiv Detail & Related papers (2024-12-11T08:10:32Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.