ImitDiff: Transferring Foundation-Model Priors for Distraction Robust Visuomotor Policy
- URL: http://arxiv.org/abs/2502.09649v2
- Date: Sat, 08 Nov 2025 07:31:30 GMT
- Title: ImitDiff: Transferring Foundation-Model Priors for Distraction Robust Visuomotor Policy
- Authors: Yuhang Dong, Haizhou Ge, Yupei Zeng, Jiangning Zhang, Beiwen Tian, Hongrui Zhu, Yufei Jia, Ruixiang Wang, Zhucun Xue, Guyue Zhou, Longhua Ma, Guanzhong Tian,
- Abstract summary: ImitDiff is a diffusion-based imitation learning policy guided by fine-grained semantics.<n>Our method transforms high-level instructions into pixel-level visual semantic masks.<n>ImitDiff exhibits strong generalization in zero-shot settings involving novel objects and visual distractions.
- Score: 39.06557194970261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visuomotor imitation learning policies enable robots to efficiently acquire manipulation skills from visual demonstrations. However, as scene complexity and visual distractions increase, policies that perform well in simple settings often experience substantial performance degradation. To address this challenge, we propose ImitDiff, a diffusion-based imitation learning policy guided by fine-grained semantics within a dual-resolution workflow. Leveraging pretrained priors of vision-language foundation models, our method transforms high-level instructions into pixel-level visual semantic masks. These masks guide a dual-resolution perception pipeline that captures both global context (e.g., overall layout) from low-resolution observation and fine-grained local features (e.g., geometric details) from high-resolution observation, enabling the policy to focus on task-relevant regions. Additionally, we introduce a consistency-driven diffusion transformer action head that bridges visual semantic conditions and real-time action generation. Extensive experiments demonstrate that ImitDiff outperforms state-of-the-art vision-language manipulation frameworks, as well as visuomotor imitation learning policies, particularly under increased scene complexity and visual distractions. Notably, ImitDiff exhibits strong generalization in zero-shot settings involving novel objects and visual distractions. Furthermore, our consistency-driven action head achieves an order-of-magnitude improvement in inference speed while maintaining competitive success rates.
Related papers
- Attentive Feature Aggregation or: How Policies Learn to Stop Worrying about Robustness and Attend to Task-Relevant Visual Cues [69.24378760740171]
We address visuomotor policy feature pooling as a solution to the observed lack of robustness in perturbed scenes.<n>We achieve this via Attentive Feature Aggregation (AFA), a lightweight, trainable pooling mechanism that learns to naturally attend to task-relevant visual cues.<n>Our findings show that ignoring extraneous visual information is a crucial step towards deploying robust and generalisable visuomotor policies.
arXiv Detail & Related papers (2025-11-13T19:31:05Z) - Oracle-Guided Masked Contrastive Reinforcement Learning for Visuomotor Policies [9.663452274930643]
A prevailing approach for learning visuomotor policies is to employ reinforcement learning to map high-dimensional visual observations directly to action commands.<n>We propose Oracle-Guided Masked Contrastive Reinforcement Learning (OMC-RL), a novel framework designed to improve the sample efficiency and performance of visuomotor policy learning.
arXiv Detail & Related papers (2025-10-07T08:49:31Z) - Control and Realism: Best of Both Worlds in Layout-to-Image without Training [59.16447569868382]
We present WinWinLay, a training-free method for layout-to-Image generation.<n>We propose two key strategies, Non-local Attention Energy and Adaptive Update, that collaboratively enhance control precision and realism.<n>WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.
arXiv Detail & Related papers (2025-06-18T15:39:02Z) - Zero-Shot Visual Generalization in Robot Manipulation [0.13280779791485384]
Current approaches often sidestep the problem by relying on invariant representations such as point clouds and depth.<n>Disentangled representation learning has recently shown promise in enabling vision-based reinforcement learning policies to be robust to visual distribution shifts.<n>We demonstrate zero-shot adaptability to visual perturbations in both simulation and on real hardware.
arXiv Detail & Related papers (2025-05-16T22:01:46Z) - Mitigating Visual Knowledge Forgetting in MLLM Instruction-tuning via Modality-decoupled Gradient Descent [72.1517476116743]
Recent MLLMs have shown emerging visual understanding and reasoning abilities after being pre-trained on large-scale multimodal datasets.<n>Existing approaches, such as direct fine-tuning and continual learning methods, fail to explicitly address this issue.<n>We introduce a novel perspective leveraging effective rank to quantify the degradation of visual representation forgetting.<n>We propose a modality-decoupled gradient descent (MDGD) method that regulates gradient updates to maintain the effective rank of visual representations.
arXiv Detail & Related papers (2025-02-17T12:26:34Z) - Salience-Invariant Consistent Policy Learning for Generalization in Visual Reinforcement Learning [12.9372563969007]
Generalizing policies to unseen scenarios remains a critical challenge in visual reinforcement learning.<n>In unseen environments, distracting pixels may lead agents to extract representations containing task-irrelevant information.<n>We propose the Salience-Invariant Consistent Policy Learning algorithm, an efficient framework for zero-shot generalization.
arXiv Detail & Related papers (2025-02-12T12:00:16Z) - Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning [17.013498508426398]
Compositional Zero-Shot Learning (CZSL) aims to enable models to recognize novel compositions of visual states and objects that were absent during training.<n>We propose Duplex, a novel dual-prototype learning method that integrates semantic and visual prototypes through a carefully designed dual-branch architecture.
arXiv Detail & Related papers (2025-01-13T08:04:32Z) - EAGLE: Enhanced Visual Grounding Minimizes Hallucinations in Instructional Multimodal Models [54.234657224615354]
Large language models and vision transformers have demonstrated impressive zero-shot capabilities, enabling significant transferability in downstream tasks.<n>Despite incorporating vast image and language pre-training, these multi-modal architectures often generate responses that deviate from the ground truth in the image data.<n>Current methods for mitigating hallucinations generally focus on regularizing the language component, improving the fusion module, or ensembling multiple visual encoders to improve visual representation.<n>We show that a straightforward reformulation of the original contrastive pre-training task results in an improved visual encoder that can be incorporated into the instructional multi-modal architecture without additional instructional training.
arXiv Detail & Related papers (2025-01-06T00:39:31Z) - Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control [73.6361029556484]
Embodied AI agents require a fine-grained understanding of the physical world mediated through visual and language inputs.
We consider pre-trained text-to-image diffusion models, which are explicitly optimized to generate images from text prompts.
We show that Stable Control Representations enable learning policies that exhibit state-of-the-art performance on OVMM, a difficult open-vocabulary navigation benchmark.
arXiv Detail & Related papers (2024-05-09T15:39:54Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Improving In-Context Learning in Diffusion Models with Visual
Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study.
iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector.
We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z) - SeMAIL: Eliminating Distractors in Visual Imitation via Separated Models [22.472167814814448]
We propose a new model-based imitation learning algorithm named Separated Model-based Adversarial Imitation Learning (SeMAIL)
Our method achieves near-expert performance on various visual control tasks with complex observations and the more challenging tasks with different backgrounds from expert observations.
arXiv Detail & Related papers (2023-06-19T04:33:44Z) - Bilevel Fast Scene Adaptation for Low-Light Image Enhancement [50.639332885989255]
Enhancing images in low-light scenes is a challenging but widely concerned task in the computer vision.
Main obstacle lies in the modeling conundrum from distribution discrepancy across different scenes.
We introduce the bilevel paradigm to model the above latent correspondence.
A bilevel learning framework is constructed to endow the scene-irrelevant generality of the encoder towards diverse scenes.
arXiv Detail & Related papers (2023-06-02T08:16:21Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z) - Robust Deep Reinforcement Learning via Multi-View Information Bottleneck [7.188571996124112]
We introduce an auxiliary objective based on the multi-view information bottleneck (MIB) principle.
This encourages learning representations that are both predictive of the future and less sensitive to task-irrelevant distractions.
We demonstrate that our approach can achieve SOTA performance on challenging visual control tasks, even when the background is replaced with natural videos.
arXiv Detail & Related papers (2021-02-26T02:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.