Related papers: HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

URL: http://arxiv.org/abs/2411.18042v2
Date: Mon, 31 Mar 2025 08:16:49 GMT
Title: HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation
Authors: Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, Khoa Luu,
Abstract summary: Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames.<n>We propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships.<n>We introduce a new Video Scene Graph Reasoning dataset featuring 1.9M frames from third-person, egocentric, and drone views.
Score: 7.027942200231825
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.

Related papers

Language-Guided Graph Representation Learning for Video Summarization [96.2763459348758]
We propose a novel Language-guided Graph Representation Learning Network (LGRLN) for video summarization.<n>Specifically, we introduce a video graph generator that converts video frames into a structured graph to preserve temporal order and contextual dependencies.<n>Our method outperforms existing approaches across multiple benchmarks.
arXiv Detail & Related papers (2025-11-14T04:35:48Z)
KeySG: Hierarchical Keyframe-Based 3D Scene Graphs [1.5134439544218246]
KeySG represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements.<n>We leverage VLM to extract scene information, alleviating the need to explicitly model relationship edges between objects.<n>Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs.
arXiv Detail & Related papers (2025-10-01T15:53:27Z)
OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation [14.938566273427098]
Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips.<n>Existing SGA approaches leverage visual cues, often struggling to integrate valuable commonsense knowledge.<n>We propose a new approach to better understand the objects, concepts, and relationships in a scene graph.
arXiv Detail & Related papers (2025-09-06T09:35:15Z)
LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study [12.90392791734461]
Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks.<n>Recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene.<n>We introduce Text-Scene Graph (TSG) Bench, a benchmark designed to assess LLMs' ability to understand scene graphs.
arXiv Detail & Related papers (2025-05-26T04:45:12Z)
RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph [3.1671311914949545]
RAVU is a framework for video enhanced understanding by retrieval with reasoning over atemporal graph.<n>We construct a graph representation of capturing the video both spatial and temporal relationships between entities.<n>To answer complex queries, we decompose the queries into a sequence of reasoning steps and execute these steps on the graph.<n>Our approach enables more accurate understanding of long videos, particularly for queries that require multi-hop reasoning and tracking objects across frames.
arXiv Detail & Related papers (2025-05-06T04:38:09Z)
Scene Graph Disentanglement and Composition for Generalizable Complex Image Generation [44.457347230146404]
We leverage the scene graph, a powerful structured representation, for complex image generation. We employ the generative capabilities of variational autoencoders and diffusion models in a generalizable manner. Our method outperforms recent competitors based on text, layout, or scene graph.
arXiv Detail & Related papers (2024-10-01T07:02:46Z)
Multi-object event graph representation learning for Video Question Answering [4.236280446793381]
We propose a contrastive language event graph representation learning method called CLanG to address this limitation. Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA, NExT-QA and TGIF-QA-R datasets.
arXiv Detail & Related papers (2024-09-12T04:42:51Z)
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Existing methods struggle to generate scene graphs with novel visual relation concepts. We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z)
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning [62.51232333352754]
VideoDirectorGPT is a novel framework for consistent multi-scene video generation. Our proposed framework substantially improves layout and movement control in both single- and multi-scene video generation.
arXiv Detail & Related papers (2023-09-26T17:36:26Z)
Learning Multi-Granular Hypergraphs for Video-Based Person Re-Identification [110.52328716130022]
Video-based person re-identification (re-ID) is an important research topic in computer vision. We propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH) to better representational capabilities. 90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts schemes.
arXiv Detail & Related papers (2021-04-30T11:20:02Z)
Temporal Relational Modeling with Self-Supervision for Action Segmentation [38.62057004624234]
We introduce Dilated Temporal Graph Reasoning Module (DTGRM) to model temporal relations in video. In particular, we capture and model temporal relations via constructing multi-level dilated temporal graphs. Our model outperforms state-of-the-art action segmentation models on three challenging datasets.
arXiv Detail & Related papers (2020-12-14T13:41:28Z)
VLG-Net: Video-Language Graph Matching Network for Video Grounding [57.6661145190528]
Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. We recast this challenge into an algorithmic graph matching problem. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets.
arXiv Detail & Related papers (2020-11-19T22:32:03Z)
Compositional Video Synthesis with Action Graphs [112.94651460161992]
Videos of actions are complex signals containing rich compositional structure in space and time. We propose to represent the actions in a graph structure called Action Graph and present the new Action Graph To Video'' synthesis task. Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.
arXiv Detail & Related papers (2020-06-27T09:39:04Z)
Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS) AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges. Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.