Related papers: Improving Robotic Manipulation Robustness via NICE Scene Surgery

Improving Robotic Manipulation Robustness via NICE Scene Surgery

URL: http://arxiv.org/abs/2511.22777v1
Date: Thu, 27 Nov 2025 22:02:02 GMT
Title: Improving Robotic Manipulation Robustness via NICE Scene Surgery
Authors: Sajjad Pakdamansavoji, Mozhgan Pourkeshavarz, Adam Sigal, Zhiyuan Li, Rui Heng Yang, Amir Rasouli,
Abstract summary: We propose an effective and scalable framework, Naturalistic Inpainting for Context Enhancement (NICE)<n>NICE performs three editing operations, object replacement, restyling, and removal of distracting (non-target) objects.<n>Unlike previous approaches, NICE requires no additional robot data collection, simulator access, or custom model training, making it readily applicable to existing robotic datasets.
Score: 22.613189865166557
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Learning robust visuomotor policies for robotic manipulation remains a challenge in real-world settings, where visual distractors can significantly degrade performance and safety. In this work, we propose an effective and scalable framework, Naturalistic Inpainting for Context Enhancement (NICE). Our method minimizes out-of-distribution (OOD) gap in imitation learning by increasing visual diversity through construction of new experiences using existing demonstrations. By utilizing image generative frameworks and large language models, NICE performs three editing operations, object replacement, restyling, and removal of distracting (non-target) objects. These changes preserve spatial relationships without obstructing target objects and maintain action-label consistency. Unlike previous approaches, NICE requires no additional robot data collection, simulator access, or custom model training, making it readily applicable to existing robotic datasets. Using real-world scenes, we showcase the capability of our framework in producing photo-realistic scene enhancement. For downstream tasks, we use NICE data to finetune a vision-language model (VLM) for spatial affordance prediction and a vision-language-action (VLA) policy for object manipulation. Our evaluations show that NICE successfully minimizes OOD gaps, resulting in over 20% improvement in accuracy for affordance prediction in highly cluttered scenes. For manipulation tasks, success rate increases on average by 11% when testing in environments populated with distractors in different quantities. Furthermore, we show that our method improves visual robustness, lowering target confusion by 6%, and enhances safety by reducing collision rate by 7%.

Related papers

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining [4.039082584778385]
We introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP)<n>From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates.<n>The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns.
arXiv Detail & Related papers (2026-01-31T23:32:54Z)
CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos [73.51386721543135]
We propose Contrastive Latent Action Pretraining (CLAP), a framework that aligns the visual latent space from videos with a proprioceptive latent space from robot trajectories.<n>CLAP maps video transitions onto a quantized, physically executable codebook.<n>We introduce a dual-formulation VLA framework offering both CLAP-NTP, an autoregressive model excelling at instruction following and object generalization, and CLAP-RF, a Rectified Flow-based policy designed for high-frequency, precise manipulation.
arXiv Detail & Related papers (2026-01-07T16:26:33Z)
VENTURA: Adapting Image Diffusion Models for Unified Task Conditioned Navigation [15.811034169990423]
VENTURA is a vision-supervised navigation system that finetunes internet-pretrained image diffusion models for path planning.<n>A lightweight behavior-cloning policy grounds these visual plans into executable trajectories, yielding an interface that follows natural language instructions.<n>In extensive real-world evaluations, VENTURA outperforms state-of-the-art foundation model baselines on object reaching, obstacle avoidance, and terrain preference tasks.
arXiv Detail & Related papers (2025-10-01T19:21:28Z)
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge [41.030494146004806]
We propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling.<n>DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning.<n>Experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks.
arXiv Detail & Related papers (2025-07-06T16:14:29Z)
3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks [19.026406684039006]
Recent work has demonstrated the capabilities of fine-tuning large Vision-Language Models to learn the mapping between RGB images, language instructions, and joint space control.<n>In this work, we explore methods to improve the scene context awareness of a popular recent Vision-Language-Action model.<n>Our proposed model, 3D-CAVLA, improves the success rate across various LIBERO task suites, achieving an average success rate of 98.1$%$.
arXiv Detail & Related papers (2025-05-09T05:32:40Z)
Precise Mobile Manipulation of Small Everyday Objects [11.45585588241935]
We develop Servoing with Vision Models (SVM), a closed-loop framework that enables a mobile manipulator to tackle precise tasks involving the manipulation of small objects.<n>SVM uses state-of-the-art vision foundation models to generate 3D targets for visual servoing to enable diverse tasks in novel environments.<n>We conduct a large-scale evaluation spanning experiments in 10 novel environments across 6 buildings including 72 different object instances.
arXiv Detail & Related papers (2025-02-19T18:59:17Z)
Object-Centric Latent Action Learning [70.3173534658611]
We propose a novel object-centric latent action learning framework that centers on objects rather than pixels.<n>We leverage self-supervised object-centric pretraining to disentangle action-related and distracting dynamics.<n>Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%.
arXiv Detail & Related papers (2025-02-13T11:27:05Z)
Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics [68.36528819227641]
This paper systematically evaluates the robustness of Vision-Language-Action (VLA) models.<n>We introduce two untargeted attack objectives that leverage spatial foundations to destabilize robotic actions, and a targeted attack objective that manipulates the robotic trajectory.<n>We design an adversarial patch generation approach that places a small, colorful patch within the camera's view, effectively executing the attack in both digital and physical environments.
arXiv Detail & Related papers (2024-11-18T01:52:20Z)
Uncertainty-aware Active Learning of NeRF-based Object Models for Robot Manipulators using Visual and Re-orientation Actions [8.059133373836913]
This paper presents an approach that enables a robot to rapidly learn the complete 3D model of a given object for manipulation in unfamiliar orientations. We use an ensemble of partially constructed NeRF models to quantify model uncertainty to determine the next action. Our approach determines when and how to grasp and re-orient an object given its partial NeRF model and re-estimates the object pose to rectify misalignments introduced during the interaction.
arXiv Detail & Related papers (2024-04-02T10:15:06Z)
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually. We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions. We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z)
A Framework for Efficient Robotic Manipulation [79.10407063260473]
We show that a single robotic arm can learn sparse-reward manipulation policies from pixels. We show that, given only 10 demonstrations, a single robotic arm can learn sparse-reward manipulation policies from pixels.
arXiv Detail & Related papers (2020-12-14T22:18:39Z)
Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.