Related papers: V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

URL: http://arxiv.org/abs/2503.11495v1
Date: Fri, 14 Mar 2025 15:21:44 GMT
Title: V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
Authors: Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, Shaogang Gong,
Abstract summary: We introduce a Video S-Temporal Reasoning (V-STa) benchmark to address these shortcomings.<n>We construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs.<n>Experiments from 14 Video-LLMs reveal significant gaps between current Video-LLMs and the needs for robust and consistent consistent reasoning.
Score: 40.18308199837137
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames ("when") and then analyse the spatial relationships ("where") between key objects, and finally leverage these relationships to draw inferences ("what"). However, can Video Large Language Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained "memory" of co-occurrences as biases in generating answers. In this work, we introduce a Video Spatio-Temporal Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning.

Related papers

TimeLogic: A Temporal Logic Benchmark for Video QA [64.32208175236323]
We introduce the TimeLogic QA (TLQA) framework to automatically generate temporal logical questions.<n>We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate 2k and 10k QA pairs for each category.<n>We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
arXiv Detail & Related papers (2025-01-13T11:12:59Z)
Neural-Symbolic VideoQA: Learning Compositional Spatio-Temporal Reasoning for Real-world Video Question Answering [0.9712140341805068]
We propose a neural-symbolic framework called Symbolic-world VideoQA (NSVideo-QA) for real-world VideoQA tasks. NSVideo-QA exhibits internal consistency in answering compositional questions and significantly improves the capability of logical inference for VideoQA tasks.
arXiv Detail & Related papers (2024-04-05T10:30:38Z)
Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering [16.502197578954917]
graph-based methods for VideoQA usually ignore keywords in questions and employ a simple graph to aggregate features. We propose a Keyword-aware Relative Spatio-Temporal (KRST) graph network for VideoQA.
arXiv Detail & Related papers (2023-07-25T04:41:32Z)
Discovering Spatio-Temporal Rationales for Video Question Answering [68.33688981540998]
This paper strives to solve complex video question answering (VideoQA) which features long video containing multiple objects and events at different time. We propose a Spatio-Temporal Rationalization (STR) that adaptively collects question-critical moments and objects using cross-modal interaction. We also propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism.
arXiv Detail & Related papers (2023-07-22T12:00:26Z)
Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query. Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions. We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z)
HySTER: A Hybrid Spatio-Temporal Event Reasoner [75.41988728376081]
We present the HySTER: a Hybrid Spatio-Temporal Event Reasoner to reason over physical events in videos. We define a method based on general temporal, causal and physics rules which can be transferred across tasks. This work sets the foundations for the incorporation of inductive logic programming in the field of VideoQA.
arXiv Detail & Related papers (2021-01-17T11:07:17Z)
Long Short-Term Relation Networks for Video Action Detection [155.13392337831166]
Long Short-Term Relation Networks (LSTR) are presented in this paper. LSTR aggregates and propagates relation to augment features for video action detection. Extensive experiments are conducted on four benchmark datasets.
arXiv Detail & Related papers (2020-03-31T10:02:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.