RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
- URL: http://arxiv.org/abs/2602.22013v2
- Date: Thu, 05 Mar 2026 07:12:37 GMT
- Title: RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
- Authors: I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang, Wei-Ting Chen,
- Abstract summary: Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence.<n>Existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders.<n>We introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization.
- Score: 12.753436440584409
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence. However, existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders, leading to errors in both retrieval and generation stages. To address this limitation, we introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization. RobustVisRAG uses a non-causal path to capture degradation signals through unidirectional attention and a causal path to learn purified semantics guided by these signals. Together with the proposed Non-Causal Distortion Modeling and Causal Semantic Alignment objectives, the framework enforces a clear separation between semantics and degradations, enabling stable retrieval and generation under challenging visual conditions. To evaluate robustness under realistic conditions, we introduce the Distortion-VisRAG dataset, a large-scale benchmark containing both synthetic and real-world degraded documents across seven domains, with 12 synthetic and 5 real distortion types that comprehensively reflect practical visual degradations. Experimental results show that RobustVisRAG improves retrieval, generation, and end-to-end performance by 7.35%, 6.35%, and 12.40%, respectively, on real-world degradations, while maintaining comparable accuracy on clean inputs.
Related papers
- ERGO: Excess-Risk-Guided Optimization for High-Fidelity Monocular 3D Gaussian Splatting [63.138778159026934]
We propose an adaptive optimization framework guided by excess risk decomposition, termed ERGO.<n> ERGO dynamically estimates the view-specific excess risk and adaptively adjust loss weights during optimization.<n>Experiments on the Google Scanned Objects dataset and the OmniObject3D dataset demonstrate the superiority of ERGO over existing state-of-the-art methods.
arXiv Detail & Related papers (2026-02-10T20:44:43Z) - Improving Robustness of Vision-Language-Action Models by Restoring Corrupted Visual Inputs [6.2827295422415235]
Vision-Language-Action (VLA) models have emerged as a dominant paradigm for generalist robotic manipulation.<n>However, reliable real-world deployment is severely hindered by their fragility to visual disturbances.<n>We introduce the Corruption Restoration Transformer (CRT), a vision transformer designed to immunize VLA models against sensor disturbances.
arXiv Detail & Related papers (2026-02-01T11:09:08Z) - ReViP: Reducing False Completion in Vision-Language-Action Models with Vision-Proprioception Rebalance [50.05984919728878]
We present ReViP, a novel VLA framework with Vision-Proprioception Rebalance to enhance visual grounding and robustness under perturbations.<n>Specifically, we use an external VLM as a task-stage observer to extract real-time task-centric visual cues from visual observations.<n>To evaluate false completion, we propose the first False-Completion Benchmark Suite built on LIBERO with controlled settings such as Object-Drop.
arXiv Detail & Related papers (2026-01-23T11:31:07Z) - Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding [54.05243949024302]
Existing robust MLLMs rely on implicit training/adaptation that focuses solely on visual encoder generalization.<n>We propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains.<n>Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity.
arXiv Detail & Related papers (2025-12-19T12:56:17Z) - RobustGait: Robustness Analysis for Appearance Based Gait Recognition [20.556295879194277]
We present RobustGait, a framework for fine-grained robustness evaluation of appearance-based gait recognition systems.<n>The benchmark introduces 15 corruption types at 5 severity levels across CASIA-B, CCPG, and SUSTech1K, with in-the-wild validation on MEVID, and evaluates six state-of-the-art gait systems.<n>Applying noise at the RGB level better reflects real-world degradation, and reveal how distortions propagate through silhouette extraction to the downstream gait recognition systems.
arXiv Detail & Related papers (2025-11-17T07:12:06Z) - Robust Diagram Reasoning: A Framework for Enhancing LVLM Performance on Visually Perturbed Scientific Diagrams [0.81996963503528]
Large Language Models (LLMs) and their multimodal variants (LVLMs) hold immense promise for scientific and engineering applications.<n>Existing evaluation benchmarks largely overlook this challenge, leaving the robust reasoning capabilities of LVLMs underexplored.<n>We introduce the Robust Diagram Reasoning (RDR) framework, a novel approach designed to enhance and rigorously evaluate LVLMs' performance under such conditions.
arXiv Detail & Related papers (2025-08-23T09:50:58Z) - Understanding and Improving Training-Free AI-Generated Image Detections with Vision Foundation Models [68.90917438865078]
Deepfake techniques for facial synthesis and editing pose serious risks for generative models.<n>In this paper, we investigate how detection performance varies across model backbones, types, and datasets.<n>We introduce Contrastive Blur, which enhances performance on facial images, and MINDER, which addresses noise type bias, balancing performance across domains.
arXiv Detail & Related papers (2024-11-28T13:04:45Z) - Benchmark Granularity and Model Robustness for Image-Text Retrieval [44.045767657945895]
We show how dataset granularity and query perturbations affect retrieval performance and robustness.<n>We show that richer captions consistently enhance retrieval, especially in text-to-image tasks.<n>Our results highlight variation in model robustness and a dataset-dependent relationship between caption granularity and sensitivity perturbation.
arXiv Detail & Related papers (2024-07-21T18:08:44Z) - Towards Evaluating the Robustness of Visual State Space Models [63.14954591606638]
Vision State Space Models (VSSMs) have demonstrated remarkable performance in visual perception tasks.
However, their robustness under natural and adversarial perturbations remains a critical concern.
We present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios.
arXiv Detail & Related papers (2024-06-13T17:59:44Z) - DR2: Diffusion-based Robust Degradation Remover for Blind Face
Restoration [66.01846902242355]
Blind face restoration usually synthesizes degraded low-quality data with a pre-defined degradation model for training.
It is expensive and infeasible to include every type of degradation to cover real-world cases in the training data.
We propose Robust Degradation Remover (DR2) to first transform the degraded image to a coarse but degradation-invariant prediction, then employ an enhancement module to restore the coarse prediction to a high-quality image.
arXiv Detail & Related papers (2023-03-13T06:05:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.