Related papers: Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter

Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter

URL: http://arxiv.org/abs/2503.09423v2
Date: Wed, 02 Apr 2025 09:52:34 GMT
Title: Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter
Authors: Kechun Xu, Xunlong Xia, Kaixuan Wang, Yifei Yang, Yunxuan Mao, Bing Deng, Rong Xiong, Yue Wang,
Abstract summary: We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place.<n>Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets.<n>We propose an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer.
Score: 26.44450403993957
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place. Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets. Others combine foundation models in a zero-shot setting, suffering from cascading errors. In addition, they primarily leverage vision and language foundation models, focusing less on action priors. In this paper, we aim to develop an effective policy by integrating foundation priors from vision, language, and action. We propose A$^2$, an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer. The alignment formulation enables our policy to train with less data and preserve zero-shot generalization capabilities. We show that a shared policy for both pick and place actions enhances the performance for each task, and introduce a policy adaptation scheme to accommodate the multi-modal nature of actions. Extensive experiments in simulation and the real-world show that our policy achieves higher task success rates with fewer steps for both pick and place tasks in clutter, effectively generalizing to unseen objects and language instructions. Videos and codes are available at https://xukechun.github.io/papers/A2.

Related papers

Policy Learning with a Natural Language Action Space: A Causal Approach [24.096991077437146]
This paper introduces a novel causal framework for multi-stage decision-making in natural language action spaces.<n>Our approach employs Q-learning to estimate Dynamic Treatment Regimes (DTR) through a single model.<n>A key technical contribution of our approach is a decoding strategy that translates optimized embeddings back into coherent natural language.
arXiv Detail & Related papers (2025-02-24T17:26:07Z)
ACT-JEPA: Joint-Embedding Predictive Architecture Improves Policy Representation Learning [90.41852663775086]
ACT-JEPA is a novel architecture that integrates imitation learning and self-supervised learning.<n>We train a policy to predict action sequences and abstract observation sequences.<n>Our experiments show that ACT-JEPA improves the quality of representations by learning temporal environment dynamics.
arXiv Detail & Related papers (2025-01-24T16:41:41Z)
Affordance-Centric Policy Learning: Sample Efficient and Generalisable Robot Policy Learning using Affordance-Centric Task Frames [15.800100875117312]
Affordances are central to robotic manipulation, where most tasks can be simplified to interactions with task-specific regions on objects. We propose an affordance-centric policy-learning approach that centres and appropriately textitorients a textittask frame on these affordance regions. We demonstrate that our approach can learn manipulation tasks using behaviour cloning from as little as 10 demonstrations, with equivalent generalisation to an image-based policy trained on 305 demonstrations.
arXiv Detail & Related papers (2024-10-15T23:57:35Z)
Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies [25.760946763103483]
We propose Imagination Policy, a novel multi-task key-frame policy network for solving high-precision pick and place tasks.<n>Instead of learning actions directly, Imagination Policy generates point clouds to imagine desired states which are then translated to actions using rigid action estimation.
arXiv Detail & Related papers (2024-06-17T17:00:41Z)
Scalable Language Model with Generalized Continual Learning [58.700439919096155]
The Joint Adaptive Re-ization (JARe) is integrated with Dynamic Task-related Knowledge Retrieval (DTKR) to enable adaptive adjustment of language models based on specific downstream tasks. Our method demonstrates state-of-the-art performance on diverse backbones and benchmarks, achieving effective continual learning in both full-set and few-shot scenarios with minimal forgetting.
arXiv Detail & Related papers (2024-04-11T04:22:15Z)
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually. We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions. We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z)
Goal Representations for Instruction Following: A Semi-Supervised Language Interface to Control [58.06223121654735]
We show a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data. Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image. We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data.
arXiv Detail & Related papers (2023-06-30T20:09:39Z)
Object-centric Inference for Language Conditioned Placement: A Foundation Model based Approach [12.016988248578027]
We focus on the task of language-conditioned object placement, in which a robot should generate placements that satisfy all the spatial constraints in language instructions. We propose an object-centric framework that leverages foundation models to ground the reference objects and spatial relations for placement, which is more sample efficient and generalizable.
arXiv Detail & Related papers (2023-04-06T06:51:15Z)
Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space [76.46113138484947]
General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments. To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach goals for a wide range of tasks on command. We propose Planning to Practice, a method that makes it practical to train goal-conditioned policies for long-horizon tasks.
arXiv Detail & Related papers (2022-05-17T06:58:17Z)
Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings. We demonstrate that this framework enables effective generalization across different environments. For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)
Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks. We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.