Related papers: GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking

GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking

URL: http://arxiv.org/abs/2602.17555v2
Date: Sat, 21 Feb 2026 08:46:12 GMT
Title: GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking
Authors: Zixu Cheng, Da Li, Jian Hu, Yuhang Zang, Ziquan Liu, Shaogang Gong, Wei Li,
Abstract summary: Video reasoning requires understanding the causal relationships between events in a video.<n>Existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries.<n>We propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs.
Score: 45.90413025033315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.

Related papers

VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models [8.155587933125673]
Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos.<n>We introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu.
arXiv Detail & Related papers (2026-01-15T02:40:41Z)
Episodic Memory Representation for Long-form Video Understanding [52.33907540905242]
Large Video Language Models excel at general video understanding but struggle with long-form context window limits.<n>We introduce Video-EM, a training free framework inspired by the principles of human memory.<n>Video-EM achieves performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
arXiv Detail & Related papers (2025-08-13T04:33:07Z)
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action [28.930109403769166]
We propose TEMPURA, a two-stage training framework that enhances video temporal understanding.<n>TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations.<n>We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps.
arXiv Detail & Related papers (2025-05-02T21:00:17Z)
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation [7.027942200231825]
Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames.<n>We propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships.<n>We introduce a new Video Scene Graph Reasoning dataset featuring 1.9M frames from third-person, egocentric, and drone views.
arXiv Detail & Related papers (2024-11-27T04:24:39Z)
EventHallusion: Diagnosing Event Hallucinations in Video LLMs [42.66453293963568]
Multimodal Large Language Models (MLLMs) have made significant progress in the video comprehension field.<n>We propose EventHallusion, a novel benchmark that focuses on assessing the VideoLLMs' hallucination toward event.<n>We also propose a simple yet effective method, called Temporal Contrastive Decoding (TCD), to tackle the hallucination problems of VideoLLMs.
arXiv Detail & Related papers (2024-09-25T03:49:46Z)
RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video. Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z)
Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling [96.64607294592062]
Video Semantic Label Roleing (VidSRL) aims to detect salient events from given videos. Recent endeavors have put forth methods for VidSRL, but they can be subject to two key drawbacks.
arXiv Detail & Related papers (2023-08-09T17:20:14Z)
Relational Graph Learning for Grounded Video Description Generation [85.27028390401136]
Grounded description (GVD) encourages captioning models to attend to appropriate video regions dynamically and generate a description. Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description. We design a novel relational graph learning framework for GVD, in which a language-refined scene graph representation is designed to explore fine-grained visual concepts.
arXiv Detail & Related papers (2021-12-02T03:48:45Z)
Temporal Relational Modeling with Self-Supervision for Action Segmentation [38.62057004624234]
We introduce Dilated Temporal Graph Reasoning Module (DTGRM) to model temporal relations in video. In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs. Our model outperforms state-of-the-art action segmentation models on three challenging datasets.
arXiv Detail & Related papers (2020-12-14T13:41:28Z)
Visual Relation Grounding in Videos [86.06874453626347]
We explore a novel named visual Relation Grounding in Videos (RGV) This task aims at providing supportive visual facts for other video-language tasks (e.g., video grounding and video question answering) We tackle challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical-temporal region. Experimental results demonstrate our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts.
arXiv Detail & Related papers (2020-07-17T08:20:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.