DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes
- URL: http://arxiv.org/abs/2505.03581v1
- Date: Tue, 06 May 2025 14:41:42 GMT
- Title: DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes
- Authors: Sergey Linok, Vadim Semenov, Anastasia Trunova, Oleg Bulichev, Dmitry Yudin,
- Abstract summary: We introduce DyGEnc - a novel method for.<n>a Dynamic Graph.<n>This method integrates compressed spatial-temporal structural observation with the cognitive capabilities of large language models.<n>DyGEnc outperforms existing visual methods by a large margin of 15-25% in addressing queries regarding the history of human-to-object interactions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The analysis of events in dynamic environments poses a fundamental challenge in the development of intelligent agents and robots capable of interacting with humans. Current approaches predominantly utilize visual models. However, these methods often capture information implicitly from images, lacking interpretable spatial-temporal object representations. To address this issue we introduce DyGEnc - a novel method for Encoding a Dynamic Graph. This method integrates compressed spatial-temporal structural observation representation with the cognitive capabilities of large language models. The purpose of this integration is to enable advanced question answering based on a sequence of textual scene graphs. Extended evaluations on the STAR and AGQA datasets indicate that DyGEnc outperforms existing visual methods by a large margin of 15-25% in addressing queries regarding the history of human-to-object interactions. Furthermore, the proposed method can be seamlessly extended to process raw input images utilizing foundational models for extracting explicit textual scene graphs, as substantiated by the results of a robotic experiment conducted with a wheeled manipulator platform. We hope that these findings will contribute to the implementation of robust and compressed graph-based robotic memory for long-horizon reasoning. Code is available at github.com/linukc/DyGEnc.
Related papers
- Hi-Dyna Graph: Hierarchical Dynamic Scene Graph for Robotic Autonomy in Human-Centric Environments [41.80879866951797]
Hi-Dyna Graph is a hierarchical dynamic scene graph architecture that integrates persistent global layouts with localized dynamic semantics for embodied robotic autonomy.<n>An agent powered by large language models (LLMs) is employed to interpret the unified graph, infer latent task triggers, and generate executable instructions grounded in robotic affordances.
arXiv Detail & Related papers (2025-05-30T03:35:29Z) - One-shot Video Imitation via Parameterized Symbolic Abstraction Graphs [8.872100864022675]
We propose to interpret video demonstrations through Symbolicized Abstraction Graphs (PSAG)
We further ground denote geometric constraints through simulation to estimate non-geometric, visually imperceptible attributes.
Our approach has been validated across a range of tasks, such as Cutting Avocado, Cutting Vegetable, Pouring Liquid, Rolling Dough, and Slicing Pizza.
arXiv Detail & Related papers (2024-08-22T18:26:47Z) - SG-Bot: Object Rearrangement via Coarse-to-Fine Robotic Imagination on Scene Graphs [81.15889805560333]
We present SG-Bot, a novel rearrangement framework.
SG-Bot exemplifies lightweight, real-time, and user-controllable characteristics.
Experimental results demonstrate that SG-Bot outperforms competitors by a large margin.
arXiv Detail & Related papers (2023-09-21T15:54:33Z) - Local-Global Information Interaction Debiasing for Dynamic Scene Graph
Generation [51.92419880088668]
We propose a novel DynSGG model based on multi-task learning, DynSGG-MTL, which introduces the local interaction information and global human-action interaction information.
Long-temporal human actions supervise the model to generate multiple scene graphs that conform to the global constraints and avoid the model being unable to learn the tail predicates.
arXiv Detail & Related papers (2023-08-10T01:24:25Z) - Modeling Dynamic Environments with Scene Graph Memory [46.587536843634055]
We present a new type of link prediction problem: link prediction on partially observable dynamic graphs.
Our graph is a representation of a scene in which rooms and objects are nodes, and their relationships are encoded in the edges.
We propose a novel state representation -- Scene Graph Memory (SGM) -- with captures the agent's accumulated set of observations.
We evaluate our method in the Dynamic House Simulator, a new benchmark that creates diverse dynamic graphs following the semantic patterns typically seen at homes.
arXiv Detail & Related papers (2023-05-27T17:39:38Z) - Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types.
We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z) - OG-SGG: Ontology-Guided Scene Graph Generation. A Case Study in Transfer
Learning for Telepresence Robotics [124.08684545010664]
Scene graph generation from images is a task of great interest to applications such as robotics.
We propose an initial approximation to a framework called Ontology-Guided Scene Graph Generation (OG-SGG)
arXiv Detail & Related papers (2022-02-21T13:23:15Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.