RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models
- URL: http://arxiv.org/abs/2505.19125v1
- Date: Sun, 25 May 2025 12:44:12 GMT
- Title: RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models
- Authors: Yuqi Liu, Qin Jin, Tianyuan Qu, Xuan Liu, Yang Du, Bei Yu, Jiaya Jia,
- Abstract summary: We introduce RTime-QA, a novel benchmark designed to assess the atomic temporal event understanding ability of Large Multi-modal Models (LMMs)<n>RTime-QA comprises 822 high-quality, carefully-curated video-text questions, each meticulously annotated by human experts.<n>To advance LMMs' temporal event understanding ability, we further introduce RTime-IT, a 14k instruction-tuning dataset that employs a similar annotation process as RTime-QA.
- Score: 85.59909303288921
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Understanding accurate atomic temporal event is essential for video comprehension. However, current video-language benchmarks often fall short to evaluate Large Multi-modal Models' (LMMs) temporal event understanding capabilities, as they can be effectively addressed using image-language models. In this paper, we introduce RTime-QA, a novel benchmark specifically designed to assess the atomic temporal event understanding ability of LMMs. RTime-QA comprises 822 high-quality, carefully-curated video-text questions, each meticulously annotated by human experts. Each question features a video depicting an atomic temporal event, paired with both correct answers and temporal negative descriptions, specifically designed to evaluate temporal understanding. To advance LMMs' temporal event understanding ability, we further introduce RTime-IT, a 14k instruction-tuning dataset that employs a similar annotation process as RTime-QA. Extensive experimental analysis demonstrates that RTime-QA presents a significant challenge for LMMs: the state-of-the-art model Qwen2-VL achieves only 34.6 on strict-ACC metric, substantially lagging behind human performance. Furthermore, our experiments reveal that RTime-IT effectively enhance LMMs' capacity in temporal understanding. By fine-tuning on RTime-IT, our Qwen2-VL achieves 65.9 on RTime-QA.
Related papers
- TimeCausality: Evaluating the Causal Ability in Time Dimension for Vision Language Models [13.018267909897014]
Reasoning about temporal causality, particularly irreversible transformations of objects governed by real-world knowledge, is a fundamental aspect of human visual understanding.<n>We introduce textbfTimeCausality, a novel benchmark designed to evaluate the causal reasoning ability of Vision-Language Models (VLMs) in the temporal dimension.<n>We find that while the current SOTA open-source VLMs have achieved performance levels comparable to closed-source models like GPT-4o, they fall significantly behind on our benchmark compared with their closed-source competitors.
arXiv Detail & Related papers (2025-05-21T12:18:02Z) - Towards Fine-Grained Video Question Answering [17.582244704442747]
This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset.<n>With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding.<n>We present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding.
arXiv Detail & Related papers (2025-03-10T01:02:01Z) - Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! [22.75945626401567]
We propose a challenging evaluation benchmark named TemporalVQA.<n>The first part requires MLLMs to determine the sequence of events by analyzing temporally consecutive video frames.<n>The second part presents image pairs with varying time differences, framed as multiple-choice questions, asking MLLMs to estimate the time-lapse between images with options ranging from seconds to years.<n>Our evaluations of advanced MLLMs, including models like GPT-4o and Gemini-1.5-Pro, reveal significant challenges.
arXiv Detail & Related papers (2025-01-18T06:41:48Z) - TimeLogic: A Temporal Logic Benchmark for Video QA [64.32208175236323]
We introduce the TimeLogic QA (TLQA) framework to automatically generate temporal logical questions.<n>We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate 2k and 10k QA pairs for each category.<n>We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
arXiv Detail & Related papers (2025-01-13T11:12:59Z) - TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships.
We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains.
LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z) - LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models [53.64461404882853]
Video quality assessment (VQA) algorithms are needed to monitor and optimize the quality of streaming videos.
Here, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel visual modeling strategy for quality-aware feature extraction.
arXiv Detail & Related papers (2024-08-26T04:29:52Z) - Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge.
We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.