MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning
- URL: http://arxiv.org/abs/2509.22281v1
- Date: Fri, 26 Sep 2025 12:46:00 GMT
- Title: MesaTask: Towards Task-Driven Tabletop Scene Generation via 3D Spatial Reasoning
- Authors: Jinkun Hao, Naifu Liang, Zhen Luo, Xudong Xu, Weipeng Zhong, Ran Yi, Yichen Jin, Zhaoyang Lyu, Feng Zheng, Lizhuang Ma, Jiangmiao Pang,
- Abstract summary: The ability of robots to execute manipulation tasks requires the availability of task-relevant tabletop scenes for training.<n>Traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts.<n>We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes.
- Score: 97.97174328960807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ability of robots to interpret human instructions and execute manipulation tasks necessitates the availability of task-relevant tabletop scenes for training. However, traditional methods for creating these scenes rely on time-consuming manual layout design or purely randomized layouts, which are limited in terms of plausibility or alignment with the tasks. In this paper, we formulate a novel task, namely task-oriented tabletop scene generation, which poses significant challenges due to the substantial gap between high-level task instructions and the tabletop scenes. To support research on such a challenging task, we introduce MesaTask-10K, a large-scale dataset comprising approximately 10,700 synthetic tabletop scenes with manually crafted layouts that ensure realistic layouts and intricate inter-object relations. To bridge the gap between tasks and scenes, we propose a Spatial Reasoning Chain that decomposes the generation process into object inference, spatial interrelation reasoning, and scene graph construction for the final 3D layout. We present MesaTask, an LLM-based framework that utilizes this reasoning chain and is further enhanced with DPO algorithms to generate physically plausible tabletop scenes that align well with given task descriptions. Exhaustive experiments demonstrate the superior performance of MesaTask compared to baselines in generating task-conforming tabletop scenes with realistic layouts. Project page is at https://mesatask.github.io/
Related papers
- RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks [21.341051218915535]
We propose a Demonstration Decomposer that automatically decomposes demonstrations into sub-tasks.<n>Our method outperforms the state-of-the-art sub-task decomposer on both simulation and real-world tasks.
arXiv Detail & Related papers (2025-10-16T17:59:37Z) - ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis [15.68979922374718]
ASHiTA is a framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks.<n>Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks.
arXiv Detail & Related papers (2025-04-09T03:22:52Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - Robot Task Planning Based on Large Language Model Representing Knowledge
with Directed Graph Structures [2.3698227130544547]
We propose a task planning method that combines human expertise with an LLM and have designed an LLM prompt template, Think_Net_Prompt.
We further propose a method to progressively decompose tasks and generate a task tree to reduce the planning volume for each task.
arXiv Detail & Related papers (2023-06-08T13:10:00Z) - Unsupervised Task Graph Generation from Instructional Video Transcripts [53.54435048879365]
We consider a setting where text transcripts of instructional videos performing a real-world activity are provided.
The goal is to identify the key steps relevant to the task as well as the dependency relationship between these key steps.
We propose a novel task graph generation approach that combines the reasoning capabilities of instruction-tuned language models along with clustering and ranking components.
arXiv Detail & Related papers (2023-02-17T22:50:08Z) - Task Compass: Scaling Multi-task Pre-training with Task Prefix [122.49242976184617]
Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks.
We propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks.
Our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships.
arXiv Detail & Related papers (2022-10-12T15:02:04Z) - Sequential Manipulation Planning on Scene Graph [90.28117916077073]
We devise a 3D scene graph representation, contact graph+ (cg+), for efficient sequential task planning.
Goal configurations, naturally specified on contact graphs, can be produced by a genetic algorithm with an optimization method.
A task plan is then succinct by computing the Graph Editing Distance (GED) between the initial contact graphs and the goal configurations, which generates graph edit operations corresponding to possible robot actions.
arXiv Detail & Related papers (2022-07-10T02:01:33Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - TO-Scene: A Large-scale Dataset for Understanding 3D Tabletop Scenes [24.422147844863304]
We introduce TO-Scene, a large-scale dataset focusing on tabletop scenes.
To acquire the data, a crowdsourcing UI is developed to transfer CAD objects onto tables from ScanNet.
A tabletop-aware learning strategy is proposed for better perceiving the small-sized tabletop instances.
arXiv Detail & Related papers (2022-03-17T17:00:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.