Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency
- URL: http://arxiv.org/abs/2505.14405v1
- Date: Tue, 20 May 2025 14:18:56 GMT
- Title: Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency
- Authors: Jiafeng Liang, Shixin Jiang, Xuan Dong, Ning Wang, Zheng Chu, Hui Su, Jinlan Fu, Ming Liu, See-Kiong Ng, Bing Qin,
- Abstract summary: We propose a novel temporal robustness benchmark (TemRobBench) to assess the robustness of models.<n>We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments.<n>We design panoramic direct preference optimization (PanoDPO) to encourage LMMs to incorporate both visual and linguistic feature preferences simultaneously.
- Score: 59.05753942719665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Multimodal Models (LMMs) have recently demonstrated impressive performance on general video comprehension benchmarks. Nevertheless, for broader applications, the robustness of their temporal analysis capability needs to be thoroughly investigated yet predominantly ignored. Motivated by this, we propose a novel temporal robustness benchmark (TemRobBench), which introduces temporal inconsistency perturbations separately at the visual and textual modalities to assess the robustness of models. We evaluate 16 mainstream LMMs and find that they exhibit over-reliance on prior knowledge and textual context in adversarial environments, while ignoring the actual temporal dynamics in the video. To mitigate this issue, we design panoramic direct preference optimization (PanoDPO), which encourages LMMs to incorporate both visual and linguistic feature preferences simultaneously. Experimental results show that PanoDPO can effectively enhance the model's robustness and reliability in temporal analysis.
Related papers
- Seeing the Arrow of Time in Large Multimodal Models [55.13176722268499]
Current large multimodal models (LMMs) struggle to perceive and utilize temporal directionality in video when responding to language queries.<n>We introduce ArrowRL, a reinforcement learning (RL)-based training strategy with an innovative reverse reward that instills AoT awareness.<n>For rigorous evaluation, we additionally develop AoTBench, a new multi-faceted benchmark probing temporally challenging questions.
arXiv Detail & Related papers (2025-06-03T19:32:07Z) - Can Slow-thinking LLMs Reason Over Time? Empirical Studies in Time Series Forecasting [17.73769436497384]
Time series forecasting (TSF) is a fundamental and widely studied task, spanning methods from classical statistical approaches to modern deep learning and multimodal language modeling.<n>Meanwhile, emerging slow-thinking LLMs have demonstrated impressive multi-step reasoning capabilities across diverse domains.<n>This motivates a key question: can slow-thinking LLMs effectively reason over temporal patterns to support time series forecasting, even in zero-shot manner?
arXiv Detail & Related papers (2025-05-30T12:19:02Z) - Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models [21.579319926212296]
Large Language Models (LLMs) have emerged as powerful tools for generating coherent text, understanding context, and performing reasoning tasks.<n>They struggle with temporal reasoning, which requires processing time-related information such as event sequencing, durations, and inter-temporal relationships.<n>We introduce TISER, a novel framework that enhances the temporal reasoning abilities of LLMs through a multi-stage process that combines timeline construction with iterative self-reflection.
arXiv Detail & Related papers (2025-04-07T16:51:45Z) - Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [90.65399476233495]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning.<n>We propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach.
arXiv Detail & Related papers (2025-04-03T17:59:56Z) - MAUCell: An Adaptive Multi-Attention Framework for Video Frame Prediction [0.0]
We introduce the Multi-Attention Unit (MAUCell) which combines Generative Adrative Networks (GANs) and attention mechanisms to improve video prediction.<n>The new design system maintains equilibrium between temporal continuity and spatial accuracy to deliver reliable video prediction.
arXiv Detail & Related papers (2025-01-28T14:52:10Z) - Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding [23.477954901326978]
Existing approaches adopt either implicit temporal modeling, relying solely on the decoder, or explicit temporal modeling, employing auxiliary temporal encoders.<n>We propose the explicit Temporal (STE) to enable flexible explicit temporal modeling with adjustable receptive temporal fields and token compression ratios.<n>Our findings emphasize the critical role of explicit temporal modeling, providing actionable insights to advance video MLLMs.
arXiv Detail & Related papers (2025-01-28T08:30:58Z) - Progressive Multimodal Reasoning via Active Retrieval [64.74746997923967]
Multi-step multimodal reasoning tasks pose significant challenges for large language models (MLLMs)<n>We propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs.<n>We show that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.
arXiv Detail & Related papers (2024-12-19T13:25:39Z) - Sycophancy in Vision-Language Models: A Systematic Analysis and an Inference-Time Mitigation Framework [18.54098084470481]
We analyze sycophancy across vision-language benchmarks and propose an inference-time mitigation framework.<n>Our framework effectively mitigates sycophancy across all evaluated models, while maintaining performance on neutral prompts.
arXiv Detail & Related papers (2024-08-21T01:03:21Z) - A Novel Energy based Model Mechanism for Multi-modal Aspect-Based
Sentiment Analysis [85.77557381023617]
We propose a novel framework called DQPSA for multi-modal sentiment analysis.
PDQ module uses the prompt as both a visual query and a language query to extract prompt-aware visual information.
EPE module models the boundaries pairing of the analysis target from the perspective of an Energy-based Model.
arXiv Detail & Related papers (2023-12-13T12:00:46Z) - On the Robustness of Large Multimodal Models Against Image Adversarial
Attacks [81.2935966933355]
We study the impact of visual adversarial attacks on Large Multimodal Models (LMMs)
We find that in general LMMs are not robust to visual adversarial inputs.
We propose a new approach to real-world image classification which we term query decomposition.
arXiv Detail & Related papers (2023-12-06T04:59:56Z) - Rethinking Uncertainly Missing and Ambiguous Visual Modality in
Multi-Modal Entity Alignment [38.574204922793626]
We present a further analysis of visual modality incompleteness, benchmarking latest MMEA models on our proposed dataset MMEA-UMVM.
Our research indicates that, in the face of modality incompleteness, models succumb to overfitting the modality noise, and exhibit performance oscillations or declines at high rates of missing modality.
We introduce UMAEA, a robust multi-modal entity alignment approach designed to tackle uncertainly missing and ambiguous visual modalities.
arXiv Detail & Related papers (2023-07-30T12:16:49Z) - Temporal Relevance Analysis for Video Action Models [70.39411261685963]
We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models.
We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected.
arXiv Detail & Related papers (2022-04-25T19:06:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.