Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs
- URL: http://arxiv.org/abs/2602.01158v1
- Date: Sun, 01 Feb 2026 11:09:08 GMT
- Title: Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs
- Authors: Daniel Yezid Guarnizo Orjuela, Leonardo Scappatura, Veronica Di Gennaro, Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci,
- Abstract summary: Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation.<n>However, reliable real-world deployment is severely hindered by their fragility to visual disturbances.<n>We introduce the Corruption Restoration Transformer (CRT), a vision transformer designed to immunize VLA models against sensor disturbances.
- Score: 6.2827295422415235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation, unifying perception and control within a single end-to-end architecture. However, despite their success in controlled environments, reliable real-world deployment is severely hindered by their fragility to visual disturbances. While existing literature extensively addresses physical occlusions caused by scene geometry, a critical mode remains largely unexplored: image corruptions. These sensor-level artifacts, ranging from electronic noise and dead pixels to lens contaminants, directly compromise the integrity of the visual signal prior to interpretation. In this work, we quantify this vulnerability, demonstrating that state-of-the-art VLAs such as $π_{0.5}$ and SmolVLA, suffer catastrophic performance degradation, dropping from 90\% success rates to as low as 2\%, under common signal artifacts. To mitigate this, we introduce the Corruption Restoration Transformer (CRT), a plug-and-play and model-agnostic vision transformer designed to immunize VLA models against sensor disturbances. Leveraging an adversarial training objective, CRT restores clean observations from corrupted inputs without requiring computationally expensive fine-tuning of the underlying model. Extensive experiments across the LIBERO and Meta-World benchmarks demonstrate that CRT effectively recovers lost performance, enabling VLAs to maintain near-baseline success rates, even under severe visual corruption.
Related papers
- RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations [12.753436440584409]
Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence.<n>Existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders.<n>We introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization.
arXiv Detail & Related papers (2026-02-25T15:27:57Z) - Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z) - ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z) - Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding [54.05243949024302]
Existing robust MLLMs rely on implicit training/adaptation that focuses solely on visual encoder generalization.<n>We propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains.<n>Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity.
arXiv Detail & Related papers (2025-12-19T12:56:17Z) - AR2: Attention-Guided Repair for the Robustness of CNNs Against Common Corruptions [5.294455344248843]
Deep neural networks suffer from significant performance degradation when exposed to common corruptions.<n>We propose AR2 (Attention-Guided Repair for Robustness) to enhance the corruption robustness of pretrained CNNs.
arXiv Detail & Related papers (2025-07-08T18:37:00Z) - Analysing the Robustness of Vision-Language-Models to Common Corruptions [2.9459935333120972]
Vision-language models (VLMs) have demonstrated impressive capabilities in understanding and reasoning about visual and textual content.<n>We present the first comprehensive analysis of VLM robustness across 19 corruption types from the ImageNet-C benchmark.<n>We introduce two new benchmarks, TextVQA-C and GQA-C, to evaluate how corruptions affect scene text understanding and object-based reasoning.
arXiv Detail & Related papers (2025-04-18T13:46:32Z) - SegSTRONG-C: Segmenting Surgical Tools Robustly On Non-adversarial Generated Corruptions -- An EndoVis'24 Challenge [35.28816926000958]
We introduce SegSTRONG-C, a benchmark and challenge in surgical data science dedicated.<n>We aim to better understand model deterioration under unforeseen but plausible non-adversarial corruption.<n>The performance of challenge winners, achieving an average 0.9394 DSC and 0.9301 NSD, shows inspiring improvement.
arXiv Detail & Related papers (2024-07-16T16:50:43Z) - Boosting Visual Recognition in Real-world Degradations via Unsupervised Feature Enhancement Module with Deep Channel Prior [22.323789227447755]
Fog, low-light, and motion blur degrade image quality and pose threats to the safety of autonomous driving.
This work proposes a novel Deep Channel Prior (DCP) for degraded visual recognition.
Based on this, a novel plug-and-play Unsupervised Feature Enhancement Module (UFEM) is proposed to achieve unsupervised feature correction.
arXiv Detail & Related papers (2024-04-02T07:16:56Z) - DR2: Diffusion-based Robust Degradation Remover for Blind Face
Restoration [66.01846902242355]
Blind face restoration usually synthesizes degraded low-quality data with a pre-defined degradation model for training.
It is expensive and infeasible to include every type of degradation to cover real-world cases in the training data.
We propose Robust Degradation Remover (DR2) to first transform the degraded image to a coarse but degradation-invariant prediction, then employ an enhancement module to restore the coarse prediction to a high-quality image.
arXiv Detail & Related papers (2023-03-13T06:05:18Z) - Distortion-Disentangled Contrastive Learning [13.27998440853596]
We propose a novel POCL framework named Distortion-Disentangled Contrastive Learning (DDCL) and a Distortion-Disentangled Loss (DDL)
Our approach is the first to explicitly disentangle and exploit the DVR inside the model and feature stream to improve the overall representation utilization efficiency, robustness and representation ability.
arXiv Detail & Related papers (2023-03-09T06:33:31Z) - On the Robustness of Quality Measures for GANs [136.18799984346248]
This work evaluates the robustness of quality measures of generative models such as Inception Score (IS) and Fr'echet Inception Distance (FID)
We show that such metrics can also be manipulated by additive pixel perturbations.
arXiv Detail & Related papers (2022-01-31T06:43:09Z) - Improving robustness against common corruptions with frequency biased
models [112.65717928060195]
unseen image corruptions can cause a surprisingly large drop in performance.
Image corruption types have different characteristics in the frequency spectrum and would benefit from a targeted type of data augmentation.
We propose a new regularization scheme that minimizes the total variation (TV) of convolution feature-maps to increase high-frequency robustness.
arXiv Detail & Related papers (2021-03-30T10:44:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.