DNAct: Diffusion Guided Multi-Task 3D Policy Learning
- URL: http://arxiv.org/abs/2403.04115v2
- Date: Fri, 8 Mar 2024 09:56:47 GMT
- Title: DNAct: Diffusion Guided Multi-Task 3D Policy Learning
- Authors: Ge Yan, Yueh-Hua Wu, Xiaolong Wang
- Abstract summary: DNAct is a language-conditioned multi-task policy framework.
It integrates neural rendering pre-training and diffusion training to enforce multi-modality learning in action sequence spaces.
- Score: 17.566655138104785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents DNAct, a language-conditioned multi-task policy framework
that integrates neural rendering pre-training and diffusion training to enforce
multi-modality learning in action sequence spaces. To learn a generalizable
multi-task policy with few demonstrations, the pre-training phase of DNAct
leverages neural rendering to distill 2D semantic features from foundation
models such as Stable Diffusion to a 3D space, which provides a comprehensive
semantic understanding regarding the scene. Consequently, it allows various
applications to challenging robotic tasks requiring rich 3D semantics and
accurate geometry. Furthermore, we introduce a novel approach utilizing
diffusion training to learn a vision and language feature that encapsulates the
inherent multi-modality in the multi-task demonstrations. By reconstructing the
action sequences from different tasks via the diffusion process, the model is
capable of distinguishing different modalities and thus improving the
robustness and the generalizability of the learned representation. DNAct
significantly surpasses SOTA NeRF-based multi-task manipulation approaches with
over 30% improvement in success rate. Project website: dnact.github.io.
Related papers
- Multi-modal Relation Distillation for Unified 3D Representation Learning [30.942281325891226]
Multi-modal Relation Distillation (MRD) is a tri-modal pre-training framework designed to distill reputable large Vision-Language Models (VLM) into 3D backbones.
MRD aims to capture both intra-relations within each modality as well as cross-relations between different modalities and produce more discriminative 3D shape representations.
arXiv Detail & Related papers (2024-07-19T03:43:48Z) - RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks.
RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model.
Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z) - Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition [6.995226697189459]
We employ a multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data.
Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks.
We release our pre-trained models as well as source code publicly.
arXiv Detail & Related papers (2024-04-16T20:51:36Z) - Generative Multimodal Models are In-Context Learners [60.50927925426832]
We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences.
Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning.
arXiv Detail & Related papers (2023-12-20T18:59:58Z) - Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning [49.92517970237088]
We tackle the problem of training a robot to understand multimodal prompts.
This type of task poses a major challenge to robots' capability to understand the interconnection and complementarity between vision and language signals.
We introduce an effective framework that learns a policy to perform robot manipulation with multimodal prompts.
arXiv Detail & Related papers (2023-10-14T22:24:58Z) - Few-shot Multimodal Multitask Multilingual Learning [0.0]
We propose few-shot learning for a multimodal multitask multilingual (FM3) setting by adapting pre-trained vision and language models.
FM3 learns the most prominent tasks in the vision and language domains along with their intersections.
arXiv Detail & Related papers (2023-02-19T03:48:46Z) - Multi-Task Learning for Visual Scene Understanding [7.191593674138455]
This thesis is concerned with multi-task learning in the context of computer vision.
We propose several methods that tackle important aspects of multi-task learning.
The results show several advances in the state-of-the-art of multi-task learning.
arXiv Detail & Related papers (2022-03-28T16:57:58Z) - Channel Exchanging Networks for Multimodal and Multitask Dense Image
Prediction [125.18248926508045]
We propose Channel-Exchanging-Network (CEN) which is self-adaptive, parameter-free, and more importantly, applicable for both multimodal fusion and multitask learning.
CEN dynamically exchanges channels betweenworks of different modalities.
For the application of dense image prediction, the validity of CEN is tested by four different scenarios.
arXiv Detail & Related papers (2021-12-04T05:47:54Z) - Multi-task learning from fixed-wing UAV images for 2D/3D city modeling [0.0]
Multi-task learning is an approach to scene understanding which involves multiple related tasks each with potentially limited training data.
In urban management applications such as infrastructure development, traffic monitoring, smart 3D cities, and change detection, automated multi-task data analysis is required.
In this study, a common framework for the performance assessment of multi-task learning methods from fixed-wing UAV images for 2D/3D city modeling is presented.
arXiv Detail & Related papers (2021-08-25T14:45:42Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - M3P: Learning Universal Representations via Multitask Multilingual
Multimodal Pre-training [119.16007395162431]
M3P is a Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training.
We show that M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.
arXiv Detail & Related papers (2020-06-04T03:54:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.