AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
- URL: http://arxiv.org/abs/2507.01961v3
- Date: Sat, 05 Jul 2025 04:00:38 GMT
- Title: AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation
- Authors: Sixiang Chen, Jiaming Liu, Siyuan Qian, Han Jiang, Lily Li, Renrui Zhang, Zhuoyang Liu, Chenyang Gu, Chengkai Hou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang,
- Abstract summary: Mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks.<n>Existing methods fail to explicitly model the influence of the mobile base on manipulator control.<n>We propose the Adaptive Coordination Diffusion Transformer (AC-DiT) to enhance mobile base and manipulator coordination.
- Score: 31.314066269767057
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e.g., either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the mobile base directly influences the manipulator's actions, we introduce a mobility-to-body conditioning mechanism that guides the model to first extract base motion representations, which are then used as context prior for predicting whole-body actions. This enables whole-body control that accounts for the potential impact of the mobile base's motion. Second, to meet the perception requirements at different stages of mobile manipulation, we design a perception-aware multimodal conditioning strategy that dynamically adjusts the fusion weights between various 2D visual images and 3D point clouds, yielding visual features tailored to the current perceptual needs. This allows the model to, for example, adaptively rely more on 2D inputs when semantic information is crucial for action prediction, while placing greater emphasis on 3D geometric information when precise spatial understanding is required. We validate AC-DiT through extensive experiments on both simulated and real-world mobile manipulation tasks.
Related papers
- CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects [14.230098033626744]
Whole-body manipulation of articulated objects is a critical yet challenging task with broad applications in virtual humans and robotics.<n>We propose a novel coordinated diffusion noise optimization framework to achieve realistic whole-body motion.<n>We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility.
arXiv Detail & Related papers (2025-05-27T17:11:50Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - Quantifying the Impact of Motion on 2D Gaze Estimation in Real-World Mobile Interactions [18.294511216241805]
This paper provides empirical evidence on how user mobility and behaviour affect mobile gaze tracking accuracy.<n>Head distance, head pose, and device orientation are key factors affecting accuracy.<n>Findings highlight the need for more robust, adaptive eye-tracking systems.
arXiv Detail & Related papers (2025-02-14T21:44:52Z) - Self-Supervised Learning of Grasping Arbitrary Objects On-the-Move [8.445514342786579]
This study introduces three fully convolutional neural network (FCN) models to predict static grasp primitive, dynamic grasp primitive, and residual moving velocity error from visual inputs.
The proposed method achieved the highest grasping accuracy and pick-and-place efficiency.
arXiv Detail & Related papers (2024-11-15T02:59:16Z) - Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z) - GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency [57.9920824261925]
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment.
modeling realistic hand-object interactions is critical for applications in computer graphics, computer vision, and mixed reality.
GRIP is a learning-based method that takes as input the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction.
arXiv Detail & Related papers (2023-08-22T17:59:51Z) - 2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias.
We leverage an expressive variation of the multidimensional State Space Model.
Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z) - Semi-Weakly Supervised Object Kinematic Motion Prediction [56.282759127180306]
Given a 3D object, kinematic motion prediction aims to identify the mobile parts as well as the corresponding motion parameters.
We propose a graph neural network to learn the map between hierarchical part-level segmentation and mobile parts parameters.
The network predictions yield a large scale of 3D objects with pseudo labeled mobility information.
arXiv Detail & Related papers (2023-03-31T02:37:36Z) - Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy.
MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z) - Consolidating Kinematic Models to Promote Coordinated Mobile
Manipulations [96.03270112422514]
We construct a Virtual Kinematic Chain (VKC) that consolidates the kinematics of the mobile base, the arm, and the object to be manipulated in mobile manipulations.
A mobile manipulation task is represented by altering the state of the constructed VKC, which can be converted to a motion planning problem.
arXiv Detail & Related papers (2021-08-03T02:59:41Z) - Articulated Object Interaction in Unknown Scenes with Whole-Body Mobile
Manipulation [16.79185733369416]
We propose a two-stage architecture for autonomous interaction with large articulated objects in unknown environments.
The first stage uses a learned model to estimate the articulated model of a target object from an RGB-D input and predicts an action-conditional sequence of states for interaction.
The second stage comprises of a whole-body motion controller to manipulate the object along the generated kinematic plan.
arXiv Detail & Related papers (2021-03-18T21:32:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.