Fugu-MT 論文翻訳(概要): CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

論文の概要: CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

arxiv url: http://arxiv.org/abs/2604.04780v1
Date: Mon, 06 Apr 2026 15:54:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.258023
Title: CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models
Title（参考訳）: CLEAR:Unified Multimodal Modelにおける劣化画像理解のための生成可能性のロック解除
Authors: Xiangzhao Hao, Zefeng Zhang, Zhenyu Zhang, Linhao Yu, Yao Chen, Yiqian Zhang, Haiyun Guo, Shuohuan Wang, Yu Sun,
Abstract要約: 理解と生成を組み合わせたマルチモーダルモデルは、劣化した入力に自身の生成能力を利用することができない。本稿では,2つの機能を3段階のプログレッシブステップで接続するフレームワークであるCLEARを紹介する。実験により、CLEARはクリーンイメージ性能を維持しながら、劣化した入力に対するロバスト性を大幅に向上することが示された。
参考スコア（独自算出の注目度）: 23.357627415320025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image degradation from blur, noise, compression, and poor illumination severely undermines multimodal understanding in real-world settings. Unified multimodal models that combine understanding and generation within a single architecture are a natural fit for this challenge, as their generative pathway can model the fine-grained visual structure that degradation destroys. Yet these models fail to leverage their own generative capacity on degraded inputs. We trace this disconnect to two compounding factors: existing training regimes never ask the model to invoke generation during reasoning, and the standard decode-reencode pathway does not support effective joint optimization. We present CLEAR, a framework that connects the two capabilities through three progressive steps: (1) supervised fine-tuning on a degradation-aware dataset to establish the generate-then-answer reasoning pattern; (2) a Latent Representation Bridge that replaces the decode-reencode detour with a direct, optimizable connection between generation and reasoning; (3) Interleaved GRPO, a reinforcement learning method that jointly optimizes text reasoning and visual generation under answer-correctness rewards. We construct MMD-Bench, covering three degradation severity levels across six standard multimodal benchmarks. Experiments show that CLEAR substantially improves robustness on degraded inputs while preserving clean-image performance. Our analysis further reveals that removing pixel-level reconstruction supervision leads to intermediate visual states with higher perceptual quality, suggesting that task-driven optimization and visual quality are naturally aligned.
Abstract（参考訳）: ぼかし、ノイズ、圧縮、照明の悪さによる画像の劣化は、現実世界の設定におけるマルチモーダル理解を著しく損なう。単一のアーキテクチャにおける理解と生成を組み合わせた統一されたマルチモーダルモデルは、その生成経路が破壊するきめ細かい視覚構造をモデル化できるため、この課題に自然に適合する。しかし、これらのモデルは、劣化した入力に自身の生成能力を利用することができない。既存のトレーニング制度では、推論中にモデルに生成を要求せず、標準デコード/リコード経路は効果的な共同最適化をサポートしない。本稿では,(1)デコード・エンコード・デタウトを生成・推論の直接的・最適化可能な接続に置き換える遅延表現ブリッジ,(3)回答正当性報酬の下でテキスト推論と視覚生成を協調的に最適化する強化学習手法であるインターリーブドGRPOを提案する。我々はMDD-Benchを構築し、6つの標準マルチモーダルベンチマークで3つの劣化重大度レベルをカバーした。実験により、CLEARはクリーンイメージ性能を維持しながら、劣化した入力に対するロバスト性を大幅に向上することが示された。さらに, 画素レベルの再構築管理の除去は, 知覚的品質の高い中間的視覚状態につながり, タスク駆動型最適化と視覚的品質が自然に整合していることが示唆された。

論文の概要: CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

関連論文リスト