Fugu-MT 論文翻訳(概要): LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

論文の概要: LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

arxiv url: http://arxiv.org/abs/2606.02535v1
Date: Mon, 01 Jun 2026 17:40:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:32.546026
Title: LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models
Title（参考訳）: LL-Bench:大規模生成モデルの時代における低レベル視覚評価の再考
Authors: Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu, Haoyun Jiang, Liu Yang, Qiang Hu, Guangtao Zhai, Xiaoyun Zhang,
Abstract要約: LL-Benchは、低レベル視覚タスクにおける大規模生成モデルの能力を評価するためのベンチマークである。本稿では,大規模生成モデルの性能境界とフェールの一様モードを明らかにする体系的な診断法を提案する。 MLLMに基づく評価器であるtextbfLL-Score を提案する。
参考スコア（独自算出の注目度）: 51.91197140548632
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale generative models have demonstrated remarkable capabilities across image generation and editing tasks. However, their performance in low-level vision tasks, which require pixel-wise control, remains insufficiently studied. To address this gap, we introduce \textbf{LL-Bench}, a comprehensive \textbf{Benchmark} for evaluating the capabilities of large-scale generative models on \textbf{L}ow-\textbf{L}evel vision tasks. The benchmark comprises 2,469 real-world degraded images covering 16 low-level degradation tasks, and 28,919 restored images produced by 10 state-of-the-art large-scale generative models and 21 conventional restoration models, which are annotated with 152,020 expert-level pairwise human preferences and 28,334 quality scores. Built upon LL-Bench, we present a systematic diagnosis that reveals the performance boundaries and unique failure modes of large-scale generative models across diverse low-level vision tasks, compared with conventional representative restoration approaches. Moreover, we investigate the effectiveness of current quality evaluation metrics on LL-Bench, which exhibit significant discrepancy with human ratings. To better align restored-image quality assessment with human preferences, we further propose \textbf{LL-Score}, an MLLM-based evaluator that captures both restoration quality and hallucination existence. Extensive experiments demonstrate that LL-score not only outperforms existing image quality assessment metrics, but also serves as a promising reward model for training generative models on low-level vision tasks.
Abstract（参考訳）: 大規模な生成モデルは、画像生成や編集タスクにまたがる顕著な能力を示している。しかし、ピクセルワイズ制御を必要とする低レベルの視覚タスクにおけるそれらの性能は、まだ十分に研究されていない。このギャップに対処するために、我々は、大規模な生成モデルの能力を評価するための総合的な \textbf{LL-Bench} を紹介した。このベンチマークは、16の低レベル劣化タスクをカバーする2,469個の実世界劣化画像と、10の最先端の大規模生成モデルによって生成された28,919個の復元画像と、専門家レベルのペアワイドな152,020個の注釈付き21個の従来の復元モデルと、28,334個の品質スコアからなる。 LL-Benchを基盤として,多種多様な低レベル視覚タスクにおける大規模生成モデルの性能境界とユニークな障害モードを明らかにする。また, LL-Benchにおける評価基準の有効性について検討した。また,復元画像の品質評価と人間の嗜好の整合性を改善するため,MLLMに基づく評価装置であるtextbf{LL-Score}を提案する。大規模な実験により、LLスコアは既存の画像品質評価指標を上回るだけでなく、低レベルの視覚タスクで生成モデルをトレーニングするための有望な報酬モデルとしても機能することが示された。

論文の概要: LL-Bench: Rethinking Low-Level Vision Evaluation in the Era of Large-Scale Generative Models

関連論文リスト