Related papers: SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation

SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation

URL: http://arxiv.org/abs/2412.11026v1
Date: Sun, 15 Dec 2024 02:41:31 GMT
Title: SceneLLM: Implicit Language Reasoning in LLM for Dynamic Scene Graph Generation
Authors: Hang Zhang, Zhuoling Li, Jun Liu,
Abstract summary: SceneLLM is a framework that transforms video frames into linguistic signals (scene tokens)<n>Our method achieves state-of-the-art results on the Action Genome (AG) benchmark.<n>Extensive experiments show the effectiveness of SceneLLM in understanding and generating accurate dynamic scene graphs.
Score: 8.768484848591168
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dynamic scenes contain intricate spatio-temporal information, crucial for mobile robots, UAVs, and autonomous driving systems to make informed decisions. Parsing these scenes into semantic triplets <Subject-Predicate-Object> for accurate Scene Graph Generation (SGG) is highly challenging due to the fluctuating spatio-temporal complexity. Inspired by the reasoning capabilities of Large Language Models (LLMs), we propose SceneLLM, a novel framework that leverages LLMs as powerful scene analyzers for dynamic SGG. Our framework introduces a Video-to-Language (V2L) mapping module that transforms video frames into linguistic signals (scene tokens), making the input more comprehensible for LLMs. To better encode spatial information, we devise a Spatial Information Aggregation (SIA) scheme, inspired by the structure of Chinese characters, which encodes spatial data into tokens. Using Optimal Transport (OT), we generate an implicit language signal from the frame-level token sequence that captures the video's spatio-temporal information. To further improve the LLM's ability to process this implicit linguistic input, we apply Low-Rank Adaptation (LoRA) to fine-tune the model. Finally, we use a transformer-based SGG predictor to decode the LLM's reasoning and predict semantic triplets. Our method achieves state-of-the-art results on the Action Genome (AG) benchmark, and extensive experiments show the effectiveness of SceneLLM in understanding and generating accurate dynamic scene graphs.

Related papers

LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study [12.90392791734461]
Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks.<n>Recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene.<n>We introduce Text-Scene Graph (TSG) Bench, a benchmark designed to assess LLMs' ability to understand scene graphs.
arXiv Detail & Related papers (2025-05-26T04:45:12Z)
Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search [35.20525123189316]
Session search involves a series of interactive queries and actions to fulfill user's complex information need.<n>Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions.<n>We propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches.
arXiv Detail & Related papers (2025-05-20T10:05:06Z)
Re-Aligning Language to Visual Objects with an Agentic Workflow [73.73778652260911]
Language-based object detection aims to align visual objects with language expressions. Recent studies leverage vision-language models (VLMs) to automatically generate human-like expressions for visual objects. We propose an agentic workflow controlled by an LLM to re-align language to visual objects via adaptively adjusting image and text prompts.
arXiv Detail & Related papers (2025-03-30T16:41:12Z)
SocialGPT: Prompting LLMs for Social Relation Reasoning via Greedy Segment Optimization [70.11167263638562]
Social relation reasoning aims to identify relation categories such as friends, spouses, and colleagues from images. We first present a simple yet well-crafted framework named name, which combines the perception capability of Vision Foundation Models (VFMs) and the reasoning capability of Large Language Models (LLMs) within a modular framework.
arXiv Detail & Related papers (2024-10-28T18:10:26Z)
An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs [7.630967411418269]
Gloss-free Sign Language Translation (SLT) converts sign videos directly into spoken language sentences without relying on glosses.<n>This paper emphasizes the importance of capturing the spatial configurations and motion dynamics inherent in sign language.<n>We introduce Spatial and Motion-based Sign Language Translation (SpaMo), a novel LLM-based SLT framework.
arXiv Detail & Related papers (2024-08-20T07:10:40Z)
Dr.E Bridges Graphs with Large Language Models through Words [12.22063024099311]
We introduce an end-to-end modality-aligning framework for LLM-graph alignment: Dual-Residual Vector Quantized-Variational AutoEncoder. Our approach is purposefully designed to facilitate token-level alignment with LLMs, enabling an effective translation of the intrinsic '' of graphs into comprehensible natural language.
arXiv Detail & Related papers (2024-06-19T16:43:56Z)
Aligning Actions and Walking to LLM-Generated Textual Descriptions [3.1049440318608568]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various domains. This work explores the use of LLMs to generate rich textual descriptions for motion sequences, encompassing both actions and walking patterns.
arXiv Detail & Related papers (2024-04-18T13:56:03Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
Large Language Model with Graph Convolution for Recommendation [21.145230388035277]
Text information can sometimes be of low quality, hindering its effectiveness for real-world applications. With knowledge and reasoning capabilities capsuled in Large Language Models, utilizing LLMs emerges as a promising way for description improvement. We propose a Graph-aware Convolutional LLM method to elicit LLMs to capture high-order relations in the user-item graph.
arXiv Detail & Related papers (2024-02-14T00:04:33Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding [66.52659447360104]
CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text. We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
arXiv Detail & Related papers (2023-11-06T18:59:44Z)
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.