Seeing the Arrow of Time in Large Multimodal Models
- URL: http://arxiv.org/abs/2506.03340v1
- Date: Tue, 03 Jun 2025 19:32:07 GMT
- Title: Seeing the Arrow of Time in Large Multimodal Models
- Authors: Zihui Xue, Mi Luo, Kristen Grauman,
- Abstract summary: Current large multimodal models (LMMs) struggle to perceive and utilize temporal directionality in video when responding to language queries.<n>We introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness.<n>For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions.
- Score: 55.13176722268499
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Arrow of Time (AoT)-time's irreversible flow shaping physical events-is fundamental to video comprehension, yet remains a significant challenge for modern large multimodal models (LMMs). Current LMMs struggle to perceive and utilize temporal directionality in video when responding to language queries, obstructing deeper temporal understanding. We tackle this deficiency by first providing a critical analysis of existing benchmarks and models. We then introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness by encouraging divergent video interpretations between forward and reversed visual frames. For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions. Experiments show ArrowRL greatly advances temporal perception: it not only achieves substantial improvements on our challenging AoTBench but also demonstrably boosts performance on standard video question answering (VQA) benchmarks (with peak accuracy gains reaching over 20% and 10% respectively). This validates ArrowRL's effectiveness and highlights the critical need for dedicated AoT understanding in LMMs.
Related papers
- MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding [55.32878803528196]
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos.<n>We propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding.<n>To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning.
arXiv Detail & Related papers (2025-05-27T04:50:07Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency [59.05753942719665]
We propose a novel temporal robustness benchmark (TemRobBench) to assess the robustness of models.<n>We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments.<n>We design panoramic direct preference optimization (PanoDPO) to encourage LMMs to incorporate both visual and linguistic feature preferences simultaneously.
arXiv Detail & Related papers (2025-05-20T14:18:56Z) - Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z) - Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding [57.26400319795876]
Temporal Video Grounding (TVG) is a core challenge in long-form video understanding.<n>Recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning.<n>We propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning.
arXiv Detail & Related papers (2025-03-17T17:04:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.