Related papers: Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

URL: http://arxiv.org/abs/2512.17532v1
Date: Fri, 19 Dec 2025 12:56:17 GMT
Title: Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding
Authors: Jiaqi Tang, Jianmin Chen, Wei Wei, Xiaogang Xu, Runtao Liu, Xiangyu Wu, Qipeng Xie, Jiafei Wu, Lei Zhang, Qifeng Chen,
Abstract summary: Existing robust MLLMs rely on implicit training/adaptation that focuses solely on visual encoder generalization.<n>We propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains.<n>Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity.
Score: 54.05243949024302
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.

Related papers

RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations [12.753436440584409]
Vision-based Retrieval-Augmented Generation (VisRAG) leverages vision-language models (VLMs) to jointly retrieve relevant visual documents and generate grounded answers based on multimodal evidence.<n>Existing VisRAG models degrade in performance when visual inputs suffer from distortions such as blur, noise, low light, or shadow, where semantic and degradation factors become entangled within pretrained visual encoders.<n>We introduce RobustVisRAG, a causality-guided dual-path framework that improves VisRAG robustness while preserving efficiency and zero-shot generalization.
arXiv Detail & Related papers (2026-02-25T15:27:57Z)
Understanding Degradation with Vision Language Model [56.09241449206817]
Understanding visual degradations is a critical yet challenging problem in computer vision.<n>We introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning.<n>We also introduce textbfDU-110k, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations.
arXiv Detail & Related papers (2026-02-04T13:51:15Z)
Bridging Information Asymmetry: A Hierarchical Framework for Deterministic Blind Face Restoration [31.878334664450776]
We present textbfPrefRestore, a hierarchical framework that integrates discrete semantic logic with continuous texture generation.<n>Our methodology fundamentally addresses this information disparity through two complementary strategies.<n>Pref-Restore achieves state-of-the-art performance across synthetic and real-world benchmarks.
arXiv Detail & Related papers (2026-01-27T11:50:31Z)
LP-LLM: End-to-End Real-World Degraded License Plate Text Recognition via Large Multimodal Models [4.497411606350301]
Real-world License Plate Recognition (LPR) faces significant challenges from severe degradations such as motion blur, low resolution, and complex illumination.<n>The prevailing "restoration-then-recognition" two-stage paradigm suffers from a fundamental flaw: the pixel-level optimization objectives of image restoration models are misaligned with the semantic goals of character recognition.<n>We propose an end-to-end structure-aware multimodal reasoning framework based on Qwen3-VL.
arXiv Detail & Related papers (2026-01-14T03:32:55Z)
Mixture of Ranks with Degradation-Aware Routing for One-Step Real-World Image Super-Resolution [76.66229730098759]
In real-world image super-resolution (Real-ISR), existing approaches mainly rely on fine-tuning pre-trained diffusion models.<n>We propose a Mixture-of-Ranks (MoR) architecture for single-step image super-resolution.<n>We introduce a fine-grained expert partitioning strategy that treats each rank in LoRA as an independent expert.
arXiv Detail & Related papers (2025-11-20T04:11:44Z)
Unlocking the Essence of Beauty: Advanced Aesthetic Reasoning with Relative-Absolute Policy Optimization [63.169050703903515]
We propose Aes-R1, a comprehensive aesthetic reasoning framework with reinforcement learning (RL)<n>Aes-R1 integrates a pipeline, AesCoT, to construct and filter high-quality chain-of-thought aesthetic reasoning data.<n>Experiments demonstrate that Aes-R1 improves the backbone's average PLCC/SRCC by 47.9%/34.8%.
arXiv Detail & Related papers (2025-09-26T04:55:00Z)
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z)
Edge-Aware Normalized Attention for Efficient and Detail-Preserving Single Image Super-Resolution [27.3322913419539]
Single-image super-resolution (SISR) remains highly ill-posed because recovering structurally faithful high-frequency content from a single low-resolution observation is ambiguous.<n>Existing edge-aware methods often attach edge priors or attention branches onto increasingly complex backbones, yet ad hoc fusion frequently introduces redundancy, unstable optimization, or limited structural gains.<n>We address this gap with an edge-guided attention mechanism that derives an adaptive modulation map from jointly encoded edge features and intermediate feature activations, then applies it to normalize and reweight responses, selectively amplifying structurally salient regions while suppressing spurious textures.
arXiv Detail & Related papers (2025-09-18T02:31:24Z)
LORE: Lagrangian-Optimized Robust Embeddings for Visual Encoders [11.01163097340578]
We propose Lagrangian-d Robust Embeddings (LORE), a novel unsupervised adversarial fine-tuning framework.<n>LORE significantly improves zero-shot adversarial robustness with minimal degradation in clean data accuracy.
arXiv Detail & Related papers (2025-05-24T21:54:52Z)
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)
Unified Enhancement of the Generalization and Robustness of Language Models via Bi-Stage Optimization [2.502393972789905]
We propose a bi-stage optimization framework to uniformly enhance both the generalization and robustness of LMs.<n>We show that our method significantly improves the generalization and robustness of LMs compared to other existing methods.
arXiv Detail & Related papers (2025-03-19T13:50:36Z)
Source-Free Domain Adaptive Object Detection with Semantics Compensation [54.00183496587841]
We introduce Weak-to-strong Semantics Compensation (WSCo) for strong data augmentation.<n>WSCo compensates for the class-relevant semantics that may be lost during strong augmentation on the fly.<n>WSCo can be implemented as a generic plug-in, easily integrable with any existing SFOD pipelines.
arXiv Detail & Related papers (2024-10-07T23:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.