Related papers: GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering

URL: http://arxiv.org/abs/2511.20201v1
Date: Tue, 25 Nov 2025 11:24:25 GMT
Title: GHR-VQA: Graph-guided Hierarchical Relational Reasoning for Video Question Answering
Authors: Dionysia Danai Brilli, Dimitrios Mallis, Vassilis Pitsikalis, Petros Maragos,
Abstract summary: We propose a novel framework that incorporates graphs capture human-object interactions within video sequences.<n>Unlike traditional methods, each frame is represented as intricate and graph human nodes across frames are linked to a scene.<n>This human-rooted structure enhances interpretability by decomposing into human-object interactions.
Score: 15.887744981283179
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose GHR-VQA, Graph-guided Hierarchical Relational Reasoning for Video Question Answering (Video QA), a novel human-centric framework that incorporates scene graphs to capture intricate human-object interactions within video sequences. Unlike traditional pixel-based methods, each frame is represented as a scene graph and human nodes across frames are linked to a global root, forming the video-level graph and enabling cross-frame reasoning centered on human actors. The video-level graphs are then processed by Graph Neural Networks (GNNs), transforming them into rich, context-aware embeddings for efficient processing. Finally, these embeddings are integrated with question features in a hierarchical network operating across different abstraction levels, enhancing both local and global understanding of video content. This explicit human-rooted structure enhances interpretability by decomposing actions into human-object interactions and enables a more profound understanding of spatiotemporal dynamics. We validate our approach on the Action Genome Question Answering (AGQA) dataset, achieving significant performance improvements, including a 7.3% improvement in object-relation reasoning over the state of the art.

Related papers

Language-guided Recursive Spatiotemporal Graph Modeling for Video Summarization [47.65036144170475]
Video summarization aims to selects that are visually diverse and represent the whole story of a given video.<n>We present VideoGraph, which formulates the objects and frames as nodes of the spatial and temporal graphs.<n>In our experiments, VideoGraph achieves state-of-the-art performance on several benchmarks for generic and querylink video summarization.
arXiv Detail & Related papers (2025-09-06T05:37:31Z)
Understanding Long Videos via LLM-Powered Entity Relation Graphs [51.13422967711056]
GraphVideoAgent is a framework that maps and monitors the evolving relationships between visual entities throughout the video sequence.<n>Our approach demonstrates remarkable effectiveness when tested against industry benchmarks.
arXiv Detail & Related papers (2025-01-27T10:57:24Z)
Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering [13.294004180200496]
We introduce Local-Global Question Aware Video Embedding (LGQAVE), which incorporates three major innovations to integrate multi-modal knowledge better.<n>LGQAVE moves beyond traditional ad-hoc frame sampling by utilizing a cross-attention mechanism that precisely identifies the most relevant frames concerning the questions.<n>An additional cross-attention module integrates these local and global embeddings to generate the final video embeddings, which a language model uses to generate answers.
arXiv Detail & Related papers (2024-12-12T12:39:07Z)
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding [8.10024991952397]
Existing methods focus on complex interactivities while leveraging a simple relationship model. We propose a new approach named Hierarchical Interlacement Graph (HIG), which leverages a unified layer and graph within a hierarchical structure. Our approach demonstrates superior performance to other methods through extensive experiments conducted in various scenarios.
arXiv Detail & Related papers (2023-12-05T18:47:19Z)
Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video. With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies. The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z)
Location-aware Graph Convolutional Networks for Video Question Answering [85.44666165818484]
We propose to represent the contents in the video as a location-aware graph. Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action. Our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA, and MSVD-QA datasets.
arXiv Detail & Related papers (2020-08-07T02:12:56Z)
SumGraph: Video Summarization via Recursive Graph Modeling [59.01856443537622]
We propose graph modeling networks for video summarization, termed SumGraph, to represent a relation graph. We achieve state-of-the-art performance on several benchmarks for video summarization in both supervised and unsupervised manners.
arXiv Detail & Related papers (2020-07-17T08:11:30Z)
Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning [72.52804406378023]
Cross-modal retrieval between videos and texts has attracted growing attentions due to the rapid emergence of videos on the web. To improve fine-grained video-text retrieval, we propose a Hierarchical Graph Reasoning model, which decomposes video-text matching into global-to-local levels.
arXiv Detail & Related papers (2020-03-01T03:44:19Z)
Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS) AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges. Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.