Related papers: CAPE: Context-Aware Diffusion Policy Via Proximal Mode Expansion for Collision Avoidance

CAPE: Context-Aware Diffusion Policy Via Proximal Mode Expansion for Collision Avoidance

URL: http://arxiv.org/abs/2511.22773v1
Date: Thu, 27 Nov 2025 21:53:09 GMT
Title: CAPE: Context-Aware Diffusion Policy Via Proximal Mode Expansion for Collision Avoidance
Authors: Rui Heng Yang, Xuan Zhao, Leo Maxime Brunswic, Montgomery Alban, Mateo Clemente, Tongtong Cao, Jun Jin, Amir Rasouli,
Abstract summary: contexts-aware diffusion policy via Proximal mode Expansion (CAPE)<n>CAPE expands trajectory distribution modes with context-aware prior and guidance at inference.<n>We evaluate CAPE on diverse manipulation tasks in cluttered unseen simulated and real-world settings.
Score: 15.311155448797386
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In robotics, diffusion models can capture multi-modal trajectories from demonstrations, making them a transformative approach in imitation learning. However, achieving optimal performance following this regiment requires a large-scale dataset, which is costly to obtain, especially for challenging tasks, such as collision avoidance. In those tasks, generalization at test time demands coverage of many obstacles types and their spatial configurations, which are impractical to acquire purely via data. To remedy this problem, we propose Context-Aware diffusion policy via Proximal mode Expansion (CAPE), a framework that expands trajectory distribution modes with context-aware prior and guidance at inference via a novel prior-seeded iterative guided refinement procedure. The framework generates an initial trajectory plan and executes a short prefix trajectory, and then the remaining trajectory segment is perturbed to an intermediate noise level, forming a trajectory prior. Such a prior is context-aware and preserves task intent. Repeating the process with context-aware guided denoising iteratively expands mode support to allow finding smoother, less collision-prone trajectories. For collision avoidance, CAPE expands trajectory distribution modes with collision-aware context, enabling the sampling of collision-free trajectories in previously unseen environments while maintaining goal consistency. We evaluate CAPE on diverse manipulation tasks in cluttered unseen simulated and real-world settings and show up to 26% and 80% higher success rates respectively compared to SOTA methods, demonstrating better generalization to unseen environments.

Related papers

Scaling Dense Event-Stream Pretraining from Visual Foundation Models [112.44243079477137]
We launch a novel self-supervised pretraining method that distills visual foundation models (VFMs) to push the boundaries of event representation at scale.<n>We curate an extensive synchronized image-event collection to amplify cross-modal alignment.<n>We extend the alignment objective to semantic structures provided off-the-shelf by VFMs, indicating a broader receptive field and stronger supervision.
arXiv Detail & Related papers (2026-03-04T12:06:09Z)
LEAR: Learning Edge-Aware Representations for Event-to-LiDAR Localization [15.308350522323588]
We propose LEAR, a dual-task learning framework that jointly estimates edge structures and dense event-depth flow fields to bridge the sensing-modality divide.<n>On several popular and challenging datasets, LEAR achieves superior performance over the best prior method.
arXiv Detail & Related papers (2026-03-02T13:18:25Z)
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z)
Flow Density Control: Generative Optimization Beyond Entropy-Regularized Fine-Tuning [59.11663802446183]
Flow and diffusion generative models can be adapted to optimize task-specific objectives while preserving prior information.<n>We introduce Flow Density Control (FDC), a simple algorithm that reduces this complex problem to a specific sequence of simpler fine-tuning tasks.<n>We derive convergence guarantees for the proposed scheme under realistic assumptions by leveraging recent understanding of mirror flows.
arXiv Detail & Related papers (2025-11-27T17:19:01Z)
Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance [63.33213516925946]
We introduce textbfAlign-Then-stEer (textttATE), a novel, data-efficient, and plug-and-play adaptation framework.<n>Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.
arXiv Detail & Related papers (2025-09-02T07:51:59Z)
TransDiffuser: Diverse Trajectory Generation with Decorrelated Multi-modal Representation for End-to-end Autonomous Driving [20.679370777762987]
We propose TransDiffuser, an encoder-decoder based generative trajectory planning model.<n>We exploit a simple yet effective multi-modal representation decorrelation optimization mechanism during the denoising process.<n>TransDiffuser achieves the PDMS of 94.85 on the closed-loop planning-oriented benchmark NAVSIM.
arXiv Detail & Related papers (2025-05-14T12:10:41Z)
Multi-Agent Path Finding in Continuous Spaces with Projected Diffusion Models [57.45019514036948]
Multi-Agent Path Finding (MAPF) is a fundamental problem in robotics.<n>This work proposes a novel approach that integrates constrained optimization with diffusion models for MAPF in continuous spaces.
arXiv Detail & Related papers (2024-12-23T21:27:19Z)
Efficient Data Representation for Motion Forecasting: A Scene-Specific Trajectory Set Approach [12.335528093380631]
This study introduces a novel approach for generating scene-specific trajectory sets tailored to different contexts.<n>A deterministic goal sampling algorithm identifies relevant map regions, while our Recursive In-Distribution Subsampling (RIDS) method enhances trajectory plausibility.<n>Experiments on the Argoverse 2 dataset demonstrate that our method achieves up to a 10% improvement in Driving Area Compliance.
arXiv Detail & Related papers (2024-07-30T11:06:39Z)
Efficient Text-driven Motion Generation via Latent Consistency Training [21.348658259929053]
We propose a motion latent consistency training framework (MLCT) to solve nonlinear reverse diffusion trajectories.<n>By combining these enhancements, we achieve stable and consistency training in non-pixel modality and latent representation spaces.
arXiv Detail & Related papers (2024-05-05T02:11:57Z)
Controllable Diverse Sampling for Diffusion Based Motion Behavior Forecasting [11.106812447960186]
We introduce a novel trajectory generator named Controllable Diffusion Trajectory (CDT) CDT integrates information and social interactions into a Transformer-based conditional denoising diffusion model to guide the prediction of future trajectories. To ensure multimodality, we incorporate behavioral tokens to direct the trajectory's modes, such as going straight, turning right or left.
arXiv Detail & Related papers (2024-02-06T13:16:54Z)
Layout Sequence Prediction From Noisy Mobile Modality [53.49649231056857]
Trajectory prediction plays a vital role in understanding pedestrian movement for applications such as autonomous driving and robotics. Current trajectory prediction models depend on long, complete, and accurately observed sequences from visual modalities. We propose LTrajDiff, a novel approach that treats objects obstructed or out of sight as equally important as those with fully visible trajectories.
arXiv Detail & Related papers (2023-10-09T20:32:49Z)
DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions [52.63323657077447]
We propose DNMOT, an end-to-end trainable DeNoising Transformer for multiple object tracking. Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture. We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-09-09T04:40:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.