Related papers: Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment

Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment

URL: http://arxiv.org/abs/2506.22385v1
Date: Fri, 27 Jun 2025 16:51:15 GMT
Title: Can Video Large Multimodal Models Think Like Doubters-or Double-Down: A Study on Defeasible Video Entailment
Authors: Yue Zhang, Jilei Sun, Yunhui Guo, Vibhav Gogate,
Abstract summary: We introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters.<n>In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis.<n>For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates.
Score: 19.682019558287973
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Large Multimodal Models (VLMMs) have made impressive strides in understanding video content, but they often struggle with abstract and adaptive reasoning-the ability to revise their interpretations when new information emerges. In reality, conclusions are rarely set in stone; additional context can strengthen or weaken an initial inference. To address this, we introduce Defeasible Video Entailment (DVidE), a new task that challenges models to think like doubters, constantly updating their reasoning based on evolving evidence. In DVidE, given a video premise and a textual hypothesis, models must determine whether a new update strengthens or weakens the hypothesis (classification version) or generate a coherent update that modifies the entailment relationship (generation version). For solving the classification task, we propose the Chain of Counterfactual Thought framework, utilizing counterfactual reasoning, ASR-enhanced video content, and rationale refinement to reduce inference bias. For the generation task, we develop a framework that combines ASR output with a Large Language Model (LLM) to produce coherent, contextually relevant updates aligned with the intended strengthener or weakener goals. Additionally, we introduce a novel benchmark dataset, with strengthener/weakener annotations and an LLM-based evaluation metric specifically designed for assessing generative performance. Experimental results demonstrate significant improvements, highlighting our proposed method in enhancing dynamic reasoning capabilities of VLMMs.

Related papers

Team of One: Cracking Complex Video QA with Model Synergy [24.75732964829523]
We propose a novel framework for open-ended video question answering that enhances reasoning depth and robustness in complex real-world scenarios.<n>Existing Video-Large Multimodal Models (Video-LMMs) often exhibit limited contextual understanding, weak temporal modeling, and poor generalization to ambiguous or compositional queries.
arXiv Detail & Related papers (2025-07-18T11:12:44Z)
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z)
Reasoning is All You Need for Video Generalization: A Counterfactual Benchmark with Sub-question Evaluation [19.46864730994867]
We introduce textbfCOVER (textbfunderlineCOunterfactual textbfunderlineEo textbfunderlineReasoning), a multidimensional multimodal benchmark.<n>It decomposes complex queries into structured sub-questions, enabling fine-grained reasoning analysis.
arXiv Detail & Related papers (2025-03-12T03:25:51Z)
Your Language Model May Think Too Rigidly: Achieving Reasoning Consistency with Symmetry-Enhanced Training [66.48331530995786]
We propose syMmetry-ENhanceD (MEND) Data Augmentation, a data-centric approach that improves the model's ability to extract useful information from context.<n>Unlike existing methods that emphasize reasoning chain augmentation, our approach improves model robustness at the knowledge extraction stage.<n>Experiments on both logical and arithmetic reasoning tasks show that MEND enhances reasoning performance across diverse query variations.
arXiv Detail & Related papers (2025-02-25T03:03:35Z)
Admitting Ignorance Helps the Video Question Answering Models to Answer [82.22149677979189]
We argue that models often establish shortcuts, resulting in spurious correlations between questions and answers.<n>We propose a novel training framework in which the model is compelled to acknowledge its ignorance when presented with an intervened question.<n>In practice, we integrate a state-of-the-art model into our framework to validate its effectiveness.
arXiv Detail & Related papers (2025-01-15T12:44:52Z)
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization [19.32714581384729]
We introduce a new task called Defeasible Visual Entailment (DVE)<n>The goal is to allow the modification of the entailment relationship between an image premise and a text hypothesis based on an additional update.<n>At a high level, DVE enables models to refine their initial interpretations, leading to improved accuracy and reliability in various applications.
arXiv Detail & Related papers (2024-12-19T02:38:31Z)
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z)
QD-VMR: Query Debiasing with Contextual Understanding Enhancement for Video Moment Retrieval [7.313447367245476]
Video Moment Retrieval (VMR) aims to retrieve relevant moments of an untrimmed video corresponding to the query. We propose a novel model called QD-VMR, a query debiasing model with enhanced contextual understanding.
arXiv Detail & Related papers (2024-08-23T10:56:42Z)
Belief Revision: The Adaptability of Large Language Models Reasoning [63.0281286287648]
We introduce Belief-R, a new dataset designed to test LMs' belief revision ability when presented with new evidence. Inspired by how humans suppress prior inferences, this task assesses LMs within the newly proposed delta reasoning framework. We evaluate $sim$30 LMs across diverse prompting strategies and found that LMs generally struggle to appropriately revise their beliefs in response to new information.
arXiv Detail & Related papers (2024-06-28T09:09:36Z)
Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story. Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video. A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.