Related papers: EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation

EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation

URL: http://arxiv.org/abs/2511.13312v1
Date: Mon, 17 Nov 2025 12:47:18 GMT
Title: EL3DD: Extended Latent 3D Diffusion for Language Conditioned Multitask Manipulation
Authors: Jonas Bode, Raphael Memmesheimer, Sven Behnke,
Abstract summary: This paper seeks to harness the capabilities of diffusion models within a visuomotor policy framework to generate precise robotic trajectories.<n>By employing reference demonstrations during training, the model learns to execute manipulation tasks specified through textual commands within the robot's immediate environment.
Score: 16.468655011980843
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Acting in human environments is a crucial capability for general-purpose robots, necessitating a robust understanding of natural language and its application to physical tasks. This paper seeks to harness the capabilities of diffusion models within a visuomotor policy framework that merges visual and textual inputs to generate precise robotic trajectories. By employing reference demonstrations during training, the model learns to execute manipulation tasks specified through textual commands within the robot's immediate environment. The proposed research aims to extend an existing model by leveraging improved embeddings, and adapting techniques from diffusion models for image generation. We evaluate our methods on the CALVIN dataset, proving enhanced performance on various manipulation tasks and an increased long-horizon success rate when multiple tasks are executed in sequence. Our approach reinforces the usefulness of diffusion models and contributes towards general multitask manipulation.

Related papers

Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation [90.90219129619344]
This paper presents a novel R-prior-S, Recurrent Geometric-priormodal Policy with Spiking features.<n>To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases.<n>For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network.
arXiv Detail & Related papers (2026-01-13T23:36:30Z)
Affordance RAG: Hierarchical Multimodal Retrieval with Affordance-Aware Embodied Memory for Mobile Manipulation [20.373596661083152]
Affordance RAG is a zero-shot hierarchical multimodal retrieval framework that constructs Affordance-Aware Embodied Memory from pre-explored images.<n>Our method outperformed existing approaches in retrieval performance for mobile manipulation instruction in large-scale indoor environments.
arXiv Detail & Related papers (2025-12-22T02:55:25Z)
Exploring Conditions for Diffusion models in Robotic Control [70.27711404291573]
We explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control.<n>We find that naively applying textual conditions yields minimal or even negative gains in control tasks.<n>We propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details.
arXiv Detail & Related papers (2025-10-17T10:24:14Z)
Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models [22.826115023573205]
We infuse the predictive nature of human manipulation strategies into robot imitation learning.<n>We train a diffusion model to predict future states and compute robot actions that achieve the predicted states.<n>Our framework consistently outperforms state-of-the-art state-to-action mapping policies.
arXiv Detail & Related papers (2025-03-30T01:25:35Z)
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation. We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z)
Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks [0.0]
In this work, we focus on unsupervised vision-language--action mapping in the area of robotic manipulation.<n>We propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%.<n>Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories.
arXiv Detail & Related papers (2024-04-02T13:25:16Z)
An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z)
Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts. This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals. We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z)
Masked Diffusion with Task-awareness for Procedure Planning in Instructional Videos [16.93979476655776]
A key challenge with procedure planning in instructional videos is how to handle a large decision space consisting of a multitude of action types. We introduce a simple yet effective enhancement - a masked diffusion model. We learn a joint visual-text embedding, where a text embedding is generated by prompting a pre-trained vision-language model to focus on human actions.
arXiv Detail & Related papers (2023-09-14T03:25:37Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.