Fugu-MT 論文翻訳(概要): Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment

論文の概要: Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment

arxiv url: http://arxiv.org/abs/2512.09573v1
Date: Wed, 10 Dec 2025 12:06:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-11 15:14:53.51039
Title: Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment
Title（参考訳）: 視覚言語に基づく画像品質評価における低レベル視覚知覚の検討
Authors: Yuan Li, Zitang Sun, Yen-Ju Chen, Shin'ya Nishida,
Abstract要約: 低レベルの歪み知覚タスクを導入し、モデルが特定の歪みタイプを分類する必要がある。解析の結果,MLLMは構造的にそのような歪みを表現できるが,トレーニングテンプレートに適合しがちであることがわかった。視覚エンコーダのアライメントを改善することで、歪み認識精度が劇的に向上し、14.92%から84.43%に向上することを示す。
参考スコア（独自算出の注目度）: 7.969076042774561
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Image Quality Assessment (IQA) have leveraged Multi-modal Large Language Models (MLLMs) to generate descriptive explanations. However, despite their strong visual perception modules, these models often fail to reliably detect basic low-level distortions such as blur, noise, and compression, and may produce inconsistent evaluations across repeated inferences. This raises an essential question: do MLLM-based IQA systems truly perceive the visual features that matter? To examine this issue, we introduce a low-level distortion perception task that requires models to classify specific distortion types. Our component-wise analysis shows that although MLLMs are structurally capable of representing such distortions, they tend to overfit training templates, leading to biases in quality scoring. As a result, critical low-level features are weakened or lost during the vision-language alignment transfer stage. Furthermore, by computing the semantic distance between visual features and corresponding semantic tokens before and after component-wise fine-tuning, we show that improving the alignment of the vision encoder dramatically enhances distortion recognition accuracy, increasing it from 14.92% to 84.43%. Overall, these findings indicate that incorporating dedicated constraints on the vision encoder can strengthen text-explainable visual representations and enable MLLM-based pipelines to produce more coherent and interpretable reasoning in vision-centric tasks.
Abstract（参考訳）: 画像品質評価(IQA)の最近の進歩は、マルチモーダル大言語モデル(MLLM)を活用して記述的な説明を生み出している。しかし、その強力な視覚認識モジュールにもかかわらず、これらのモデルは、ぼやけ、ノイズ、圧縮などの基本的な低レベルの歪みを確実に検出することができず、繰り返しの推論で不整合評価を生じさせる可能性がある。 MLLMベースのIQAシステムは、重要な視覚的特徴を本当に認識しているのか? そこで本研究では,モデルが特定の歪みタイプを分類する必要がある低レベルの歪み知覚タスクを提案する。 MLLMは構造的にそのような歪みを表現できるが、トレーニングテンプレートに過度に適合する傾向があり、品質スコアリングのバイアスが生じる傾向にある。その結果、視覚言語アライメント転送段階において、臨界低レベル特徴が弱まるか失われる。さらに,コンポーネントの微調整前後の視覚特徴と対応する意味トークンのセマンティック距離を計算することにより,視覚エンコーダのアライメントを改善することにより,歪み認識精度が劇的に向上し,14.92%から84.43%に向上することを示す。これらの結果は、視覚エンコーダに専用の制約を組み込むことで、テキスト記述可能な視覚表現を強化し、MLLMベースのパイプラインが視覚中心のタスクにおいてより一貫性と解釈可能な推論を実現できることを示唆している。

論文の概要: Investigate the Low-level Visual Perception in Vision-Language based Image Quality Assessment

関連論文リスト