MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
- URL: http://arxiv.org/abs/2512.16909v1
- Date: Thu, 18 Dec 2025 18:59:03 GMT
- Title: MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning
- Authors: Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath,
- Abstract summary: Mobile manipulators in households must both navigate and manipulate.<n>This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable.<n>We introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements.
- Score: 44.61781303455069
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
Related papers
- LookPlanGraph: Embodied Instruction Following Method with VLM Graph Augmentation [47.99822253865053]
Methods that use Large Language Models (LLM) as planners for embodied instruction following tasks have become widespread.<n>One solution is to use a scene graph that contains all the necessary information.<n>Modern methods rely on prebuilt scene graphs and assume that all task-relevant information is available at the start of planning.<n>We propose LookPlanGraph - a method that leverages a scene graph composed of static assets and object priors.
arXiv Detail & Related papers (2025-12-24T15:36:21Z) - Synthetic Visual Genome [88.00433979509218]
We introduce ROBIN: an instruction-tuned with densely annotated relationships capable of constructing high-quality dense graphs at scale.<n>In total, our dataset contains 146K images and 5.6M relationships for 2.6M objects.<n> ROBIN-3B model, despite being trained on less than 3 million instances, outperforms similar-size models trained on over 300 million instances on relationship understanding benchmarks.
arXiv Detail & Related papers (2025-06-09T11:09:10Z) - LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study [12.90392791734461]
Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks.<n>Recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene.<n>We introduce Text-Scene Graph (TSG) Bench, a benchmark designed to assess LLMs' ability to understand scene graphs.
arXiv Detail & Related papers (2025-05-26T04:45:12Z) - Fine-Grained is Too Coarse: A Novel Data-Centric Approach for Efficient
Scene Graph Generation [0.7851536646859476]
We introduce the task of Efficient Scene Graph Generation (SGG) that prioritizes the generation of relevant relations.
We present a new dataset, VG150-curated, based on the annotations of the popular Visual Genome dataset.
We show through a set of experiments that this dataset contains more high-quality and diverse annotations than the one usually use in SGG.
arXiv Detail & Related papers (2023-05-30T00:55:49Z) - Unsupervised Task Graph Generation from Instructional Video Transcripts [53.54435048879365]
We consider a setting where text transcripts of instructional videos performing a real-world activity are provided.
The goal is to identify the key steps relevant to the task as well as the dependency relationship between these key steps.
We propose a novel task graph generation approach that combines the reasoning capabilities of instruction-tuned language models along with clustering and ranking components.
arXiv Detail & Related papers (2023-02-17T22:50:08Z) - Scene Graph Modification as Incremental Structure Expanding [61.84291817776118]
We focus on scene graph modification (SGM), where the system is required to learn how to update an existing scene graph based on a natural language query.
We frame SGM as a graph expansion task by introducing the incremental structure expanding (ISE)
We construct a challenging dataset that contains more complicated queries and larger scene graphs than existing datasets.
arXiv Detail & Related papers (2022-09-15T16:26:14Z) - Segmentation-grounded Scene Graph Generation [47.34166260639392]
We propose a framework for pixel-level segmentation-grounded scene graph generation.
Our framework is agnostic to the underlying scene graph generation method.
It is learned in a multi-task manner with both target and auxiliary datasets.
arXiv Detail & Related papers (2021-04-29T08:54:08Z) - Visual Distant Supervision for Scene Graph Generation [66.10579690929623]
Scene graph models usually require supervised learning on large quantities of labeled data with intensive human annotation.
We propose visual distant supervision, a novel paradigm of visual relation learning, which can train scene graph models without any human-labeled data.
Comprehensive experimental results show that our distantly supervised model outperforms strong weakly supervised and semi-supervised baselines.
arXiv Detail & Related papers (2021-03-29T06:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.