Affordance RAG: Hierarchical Multimodal Retrieval with Affordance-Aware Embodied Memory for Mobile Manipulation
- URL: http://arxiv.org/abs/2512.18987v1
- Date: Mon, 22 Dec 2025 02:55:25 GMT
- Title: Affordance RAG: Hierarchical Multimodal Retrieval with Affordance-Aware Embodied Memory for Mobile Manipulation
- Authors: Ryosuke Korekata, Quanting Xie, Yonatan Bisk, Komei Sugiura,
- Abstract summary: Affordance RAG is a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images.<n>Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments.
- Score: 20.373596661083152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this study, we address the problem of open-vocabulary mobile manipulation, where a robot is required to carry a wide range of objects to receptacles based on free-form natural language instructions. This task is challenging, as it involves understanding visual semantics and the affordance of manipulation actions. To tackle these challenges, we propose Affordance RAG, a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images. The model retrieves candidate targets based on regional and visual semantics and reranks them with affordance scores, allowing the robot to identify manipulation options that are likely to be executable in real-world environments. Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments. Furthermore, in real-world experiments where the robot performed mobile manipulation in indoor environments based on free-form instructions, the proposed method achieved a task success rate of 85%, outperforming existing methods in both retrieval performance and overall task success.
Related papers
- EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation [16.468655011980843]
This paper seeks to harness the capabilities of diffusion models within a visuomotor policy framework to generate precise robotic trajectories.<n>By employing reference demonstrations during training, the model learns to execute manipulation tasks specified through textual commands within the robot's immediate environment.
arXiv Detail & Related papers (2025-11-17T12:47:18Z) - Exploring Conditions for Diffusion models in Robotic Control [70.27711404291573]
We explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control.<n>We find that naively applying textual conditions yields minimal or even negative gains in control tasks.<n>We propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details.
arXiv Detail & Related papers (2025-10-17T10:24:14Z) - Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration [58.4036440289082]
Hand-object motion-capture (MoCap) offer large-scale, contact-rich demonstrations and hold promise for dexterous robotic scopes.<n>We introduce Dexplore, a unified single-loop optimization that performs repositories and tracking to learn robot control policies directly from MoCap at scale.
arXiv Detail & Related papers (2025-09-11T17:59:07Z) - MORE: Mobile Manipulation Rearrangement Through Grounded Language Reasoning [13.535721260188694]
MORE is a novel approach for enhancing the capabilities of language models to solve zero-shot mobile manipulation planning tasks.<n>We evaluate MORE on 81 diverse rearrangement tasks from the BEHAVIOR-1K benchmark, where it becomes the first approach to successfully solve a significant share of the benchmark.
arXiv Detail & Related papers (2025-05-05T21:26:03Z) - Subtask-Aware Visual Reward Learning from Segmented Demonstrations [97.80917991633248]
This paper introduces REDS: REward learning from Demonstration with Demonstrations, a novel reward learning framework.<n>We train a dense reward function conditioned on video segments and their corresponding subtasks to ensure alignment with ground-truth reward signals.<n>Our experiments show that REDS significantly outperforms baseline methods on complex robotic manipulation tasks in Meta-World.
arXiv Detail & Related papers (2025-02-28T01:25:37Z) - Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids [56.892520712892804]
We introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three dexterous manipulation tasks.<n>We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors.
arXiv Detail & Related papers (2025-02-27T18:59:52Z) - RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation [52.14638923430338]
We propose conditioning policies on affordances, which capture the pose of the robot at key stages of the task.
Our method, RT-Affordance, is a hierarchical model that first proposes an affordance plan given the task language.
We show on a diverse set of novel tasks how RT-Affordance exceeds the performance of existing methods by over 50%.
arXiv Detail & Related papers (2024-11-05T01:02:51Z) - Affordance-based Robot Manipulation with Flow Matching [7.51335919610328]
We present a framework for assistive robot manipulation.<n>We tackle two challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, and second, effectively learning robot action trajectories by grounding the visual affordance model.<n>We learn robot action trajectories guided by affordances in a supervised flow matching method.
arXiv Detail & Related papers (2024-09-02T09:11:28Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Learning Hierarchical Interactive Multi-Object Search for Mobile
Manipulation [10.21450780640562]
We introduce a novel interactive multi-object search task in which a robot has to open doors to navigate rooms and search inside cabinets and drawers to find target objects.
These new challenges require combining manipulation and navigation skills in unexplored environments.
We present HIMOS, a hierarchical reinforcement learning approach that learns to compose exploration, navigation, and manipulation skills.
arXiv Detail & Related papers (2023-07-12T12:25:33Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Learning Sensorimotor Primitives of Sequential Manipulation Tasks from
Visual Demonstrations [13.864448233719598]
This paper describes a new neural network-based framework for learning simultaneously low-level policies and high-level policies.
A key feature of the proposed approach is that the policies are learned directly from raw videos of task demonstrations.
Empirical results on object manipulation tasks with a robotic arm show that the proposed network can efficiently learn from real visual demonstrations to perform the tasks.
arXiv Detail & Related papers (2022-03-08T01:36:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.