Fugu-MT 論文翻訳(概要): Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

論文の概要: Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

arxiv url: http://arxiv.org/abs/2509.25787v2
Date: Sat, 04 Oct 2025 03:01:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 12:09:05.130518
Title: Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking
Title（参考訳）: 投票とランキングによる画像品質評価のための自己進化型ビジョンランゲージモデル
Authors: Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang,
Abstract要約: EvoQualityは、視覚言語モデルがその品質知覚能力を自律的に洗練することを可能にする新しいフレームワークである。擬似ラベルを生成し、相対的な品質に関するコンセンサスを確立するために、VLMの出力に対してペアで多数投票を行うことで、擬似ラベルを生成する。 VLMのゼロショット性能は、様々なIQAベンチマークでPLCCで31.8%向上した。
参考スコア（独自算出の注目度）: 22.2866006389482
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques such as self-consistency have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8\% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks.
Abstract（参考訳）: 訓練後の段階での視覚言語モデル(VLM)の改善は、典型的には教師付き微調整や強化学習に頼っている。自己整合性のような自己監督的手法は推論能力を高めるのに有効であることが証明されているが、画像品質評価(IQA)のような知覚領域への応用は未解明のままである。そこで本研究では,VLM による品質認識能力の向上を実現する新しいフレームワークである EvoQuality について紹介する。 EvoQualityは、IQAのランクに基づく性質に自己整合性の原則を適用する。擬似ラベルを生成し、相対的な品質に関するコンセンサスを確立するために、VLMの出力に対してペアで多数投票を行うことで、擬似ラベルを生成する。これらの擬似階数は、群相対ポリシー最適化(GRPO)を通してモデルの反復進化を導く忠実報酬に定式化される。自身の予測を反復的に活用することで、EvoQualityはVLMの知覚能力を徐々に洗練させる。 EvoQualityは、さまざまなIQAベンチマークでPLCCにおいて、ベースVLMのゼロショット性能を31.8\%向上させる。注目すべきなのは、完全に自己監督されているにも関わらず、EvoQualityは最先端のVLMベースのIQAモデルと競合する、あるいは超えるパフォーマンスを実現し、7つのIQAベンチマークのうち5つでこれらのモデルを上回っていることだ。

論文の概要: Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

関連論文リスト