ForeDiffusion: Foresight-Conditioned Diffusion Policy via Future View Construction for Robot Manipulation
- URL: http://arxiv.org/abs/2601.12925v1
- Date: Mon, 19 Jan 2026 10:28:42 GMT
- Title: ForeDiffusion: Foresight-Conditioned Diffusion Policy via Future View Construction for Robot Manipulation
- Authors: Weize Xie, Yi Ding, Ying He, Leilei Wang, Binwen Bai, Zheyi Zhao, Chenyang Wang, F. Richard Yu,
- Abstract summary: Diffusion strategies have advanced visual motor control by progressively denoising high-dimensional action sequences.<n>This paper proposes Foresight-Conditioned Diffusion (ForeDiffusion), by injecting the predicted future view representation into the diffusion process.<n>ForeDiffusion achieves an average success rate of 80% for the overall task, significantly outperforming the existing mainstream diffusion methods by 23% in complex tasks.
- Score: 37.67451041903772
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion strategies have advanced visual motor control by progressively denoising high-dimensional action sequences, providing a promising method for robot manipulation. However, as task complexity increases, the success rate of existing baseline models decreases considerably. Analysis indicates that current diffusion strategies are confronted with two limitations. First, these strategies only rely on short-term observations as conditions. Second, the training objective remains limited to a single denoising loss, which leads to error accumulation and causes grasping deviations. To address these limitations, this paper proposes Foresight-Conditioned Diffusion (ForeDiffusion), by injecting the predicted future view representation into the diffusion process. As a result, the policy is guided to be forward-looking, enabling it to correct trajectory deviations. Following this design, ForeDiffusion employs a dual loss mechanism, combining the traditional denoising loss and the consistency loss of future observations, to achieve the unified optimization. Extensive evaluation on the Adroit suite and the MetaWorld benchmark demonstrates that ForeDiffusion achieves an average success rate of 80% for the overall task, significantly outperforming the existing mainstream diffusion methods by 23% in complex tasks, while maintaining more stable performance across the entire tasks.
Related papers
- STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction [16.465783114087223]
iterative denoising leads to substantial inference latency, limiting control frequency in real-time closed-loop systems.<n>We propose STEP, a lightweighttemporal consistency prediction mechanism to construct high-quality warm-start actions.<n> STEP with 2 steps can achieve an average 21.6% and 27.5% higher success rate than BRIDGER and DDIM on the RoboMimic benchmark and real-world tasks.
arXiv Detail & Related papers (2026-02-09T03:50:40Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - Contamination Detection for VLMs using Multi-Modal Semantic Perturbation [73.76465227729818]
Open-source Vision-Language Models (VLMs) have achieved state-of-the-art performance on benchmark tasks.<n>Pretraining corpora raise a critical concern for both practitioners and users: inflated performance due to test-set leakage.<n>We show that existing detection approaches either fail outright or exhibit inconsistent behavior.<n>We propose a novel simple yet effective detection method based on multi-modal semantic perturbation.
arXiv Detail & Related papers (2025-11-05T18:59:52Z) - Two-Steps Diffusion Policy for Robotic Manipulation via Genetic Denoising [22.356276412952738]
Diffusion models have achieved state-of-the-art results in robotic manipulation by imitating expert demonstrations.<n>We show that by tailoring the denoising process to the specific characteristics of embodied AI tasks, diffusion policies can operate effectively.<n>We propose a population-based sampling strategy, genetic denoising, which enhances both performance and stability.
arXiv Detail & Related papers (2025-10-24T19:52:41Z) - SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization [62.958457694151384]
We introduce preference optimization into precipitation nowcasting for the first time, motivated by the success of reinforcement learning from human feedback in large language models.<n>In the first stage, the framework focuses on reducing FAR, training the model to effectively suppress false alarms.
arXiv Detail & Related papers (2025-10-22T16:11:22Z) - Boosting Fidelity for Pre-Trained-Diffusion-Based Low-Light Image Enhancement via Condition Refinement [63.54516423266521]
Pre-Trained Diffusion-Based (PTDB) methods often sacrifice content fidelity to attain higher perceptual realism.<n>We propose a novel optimization strategy for conditioning in pre-trained diffusion models, enhancing fidelity while preserving realism and aesthetics.<n>Our approach is plug-and-play, seamlessly integrating into existing diffusion networks to provide more effective control.
arXiv Detail & Related papers (2025-10-20T02:40:06Z) - Beyond Classification: Evaluating Diffusion Denoised Smoothing for Security-Utility Trade off [4.497768222083102]
Diffusion Denoised Smoothing is emerging as a promising technique to enhance model robustness.<n>We analyze three datasets with four distinct downstream tasks under three different adversarial attack algorithms.<n>High-noise diffusion denoising to clean images without any distortions significantly degrades performance by as high as 57%.<n>We introduce a novel attack strategy specifically targeting the diffusion process itself, capable of circumventing defenses in the low-noise regime.
arXiv Detail & Related papers (2025-05-21T14:49:24Z) - HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [54.64088247291416]
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments.<n>Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction.<n>We introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression.
arXiv Detail & Related papers (2025-03-13T17:59:52Z) - E2ED^2:Direct Mapping from Noise to Data for Enhanced Diffusion Models [15.270657838960114]
Diffusion models have established themselves as the de facto primary paradigm in visual generative modeling.<n>We present a novel end-to-end learning paradigm that establishes direct optimization from the final generated samples to initial noises.<n>Our method achieves substantial performance gains in terms of Fr'eche't Inception Distance (FID) and CLIP score, even with fewer sampling steps.
arXiv Detail & Related papers (2024-12-30T16:06:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.