Related papers: TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models

TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models

URL: http://arxiv.org/abs/2505.15435v1
Date: Wed, 21 May 2025 12:18:02 GMT
Title: TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models
Authors: Zeqing Wang, Shiyuan Zhang, Chengpei Tang, Keze Wang,
Abstract summary: Reasoning about temporal causality, particularly irreversible transformations of objects governed by real-world knowledge, is a fundamental aspect of human visual understanding.<n>We introduce textbfTimeCausality, a novel benchmark designed to evaluate the causal reasoning ability of Vision-Language Models (VLMs) in the temporal dimension.<n>We find that while the current SOTA open-source VLMs have achieved performance levels comparable to closed-source models like GPT-4o, they fall significantly behind on our benchmark compared with their closed-source competitors.
Score: 13.018267909897014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning about temporal causality, particularly irreversible transformations of objects governed by real-world knowledge (e.g., fruit decay and human aging), is a fundamental aspect of human visual understanding. Unlike temporal perception based on simple event sequences, this form of reasoning requires a deeper comprehension of how object states change over time. Although the current powerful Vision-Language Models (VLMs) have demonstrated impressive performance on a wide range of downstream tasks, their capacity to reason about temporal causality remains underexplored. To address this gap, we introduce \textbf{TimeCausality}, a novel benchmark specifically designed to evaluate the causal reasoning ability of VLMs in the temporal dimension. Based on our TimeCausality, we find that while the current SOTA open-source VLMs have achieved performance levels comparable to closed-source models like GPT-4o on various standard visual question answering tasks, they fall significantly behind on our benchmark compared with their closed-source competitors. Furthermore, even GPT-4o exhibits a marked drop in performance on TimeCausality compared to its results on other tasks. These findings underscore the critical need to incorporate temporal causality into the evaluation and development of VLMs, and they highlight an important challenge for the open-source VLM community moving forward. Code and Data are available at \href{https://github.com/Zeqing-Wang/TimeCausality }{TimeCausality}.

Related papers

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models [66.833085504228]
We introduce V4DLM, the first benchmark specifically designed to evaluate visual language models (VLMs)<n>Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs.<n>We identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models.
arXiv Detail & Related papers (2025-08-04T06:06:06Z)
What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning [26.671128120554457]
causal reasoning is fundamental to solving complex high-level reasoning tasks.<n>Existing benchmarks often include a mixture of reasoning questions.<n>We introduce VQA-Causal and VCR-Causal to isolate and rigorously evaluate causal reasoning abilities.
arXiv Detail & Related papers (2025-06-01T07:17:46Z)
RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models [85.59909303288921]
We introduce RTime-QA, a novel benchmark designed to assess the atomic temporal event understanding ability of Large Multi-modal Models (LMMs)<n>RTime-QA comprises 822 high-quality, carefully-curated video-text questions, each meticulously annotated by human experts.<n>To advance LMMs' temporal event understanding ability, we further introduce RTime-IT, a 14k instruction-tuning dataset that employs a similar annotation process as RTime-QA.
arXiv Detail & Related papers (2025-05-25T12:44:12Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [90.65399476233495]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning types: Temporal, Causal, Spatial, and Logical Reasoning.<n>We propose an evaluation framework that assesses Instruction Reasoning, Appearance Consistency, and Visual Plausibility with both human judges and an LMM-as-a-judge approach.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
Video SimpleQA: Towards Factuality Evaluation in Large Video Language Models [69.68265487134686]
Video SimpleQA is the first comprehensive benchmark tailored for factuality evaluation of LVLMs.<n>Our work distinguishes from existing video benchmarks through the following key features.<n>Answers are crafted as unambiguous and definitively correct in a short format.
arXiv Detail & Related papers (2025-03-24T17:46:09Z)
ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events [0.20132569095596248]
We present ChronoSense, a new benchmark for evaluating Large Language Models' temporal understanding.<n>We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently.<n>Overall, the models' low performance highlights the need for improved temporal understanding in LLMs.
arXiv Detail & Related papers (2025-01-06T14:27:41Z)
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks [84.86699025256705]
We present GEOBench-VLM, a benchmark specifically designed to evaluate Vision-Language Models (VLMs) on geospatial tasks.<n>Our benchmark features over 10,000 manually verified instructions and spanning diverse visual conditions, object types, and scales.<n>We evaluate several state-of-the-art VLMs to assess performance on geospatial-specific challenges.
arXiv Detail & Related papers (2024-11-28T18:59:56Z)
Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? [70.19200858203388]
Temporal reasoning is fundamental for large language models to comprehend the world. CoTempQA is a benchmark containing four co-temporal scenarios. Our experiments reveal a significant gap between the performance of current LLMs and human-level reasoning.
arXiv Detail & Related papers (2024-06-13T12:56:21Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
Temporal Blind Spots in Large Language Models [20.631107338678234]
Large language models (LLMs) have recently gained significant attention due to their unparalleled ability to perform various natural language processing tasks. This study investigates the underlying limitations of general-purpose LLMs when deployed for tasks that require a temporal understanding.
arXiv Detail & Related papers (2024-01-22T16:20:14Z)
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models [27.280311932711847]
We present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding. We first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects. We generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect.
arXiv Detail & Related papers (2023-11-29T07:15:34Z)
TRAM: Benchmarking Temporal Reasoning for Large Language Models [12.112914393948415]
We introduce TRAM, a temporal reasoning benchmark composed of ten datasets. We evaluate popular language models like GPT-4 and Llama2 in zero-shot and few-shot scenarios. Our findings indicate that the best-performing model lags significantly behind human performance.
arXiv Detail & Related papers (2023-10-02T00:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.