Fugu-MT 論文翻訳(概要): Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

論文の概要: Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

arxiv url: http://arxiv.org/abs/2309.14181v3
Date: Mon, 1 Jan 2024 14:48:48 GMT
ステータス: 翻訳完了
システム内更新日: 2024-01-03 01:47:26.327957
Title: Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
Title（参考訳）: Q-Bench: 低レベルのビジョンに基づく汎用基盤モデルのベンチマーク
Authors: Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, Weisi Lin
Abstract要約: MLLM(Multi-modality Large Language Models)は、コンピュータビジョンの特殊モデルから汎用基礎モデルへのシフトを触媒している。 Q-Benchは3つの領域(低レベル視覚知覚、低レベル視覚記述、全体視品質評価)でMLLMの潜在能力を評価するための総合的なベンチマークである。
参考スコア（独自算出の注目度）: 85.6008224440157
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://q-future.github.io/Q-Bench.
Abstract（参考訳）: MLLM(Multi-modality Large Language Models)の急速な進化は、コンピュータビジョンの特殊モデルから汎用基礎モデルへのシフトを引き起こした。それでも、低レベルの視覚知覚と理解においてMLLMの能力を評価するにはまだ不十分である。このギャップに対処するために、我々は3つの領域(低レベル視覚知覚、低レベル視覚記述、全体視覚品質評価)でMLLMの潜在能力を体系的に評価する総合的なベンチマークであるQ-Benchを紹介する。 a) 低レベルの知覚能力を評価するために,2,990個の多様なソース画像からなるLLVisionQAデータセットを構築し,その低レベルの属性に着目した人間に質問する。次に,これらの質問に対するMLLMの正当性を測定した。 b) MLLMの低レベル情報に基づく記述能力を検討するため, 499 画像上の長大な専門家による黄金の低レベルテキスト記述からなるLLDescribeデータセットと, MLLMの出力と黄金の記述との GPT による比較パイプラインを提案する。 c) この2つの課題に加えて, 人間の意見スコアに合わせる視覚的品質評価能力も測定した。具体的には、MLLMが定量品質スコアを予測できるソフトマックスベースの戦略を設計し、既存の画像品質評価(IQA)データセットで評価する。評価の結果,MLLMは低レベルの視覚能力を有することが明らかとなった。しかし、これらのスキルはまだ不安定で比較的不正確であり、これらの能力に対するMLLMの具体的な強化の必要性を示している。私たちのベンチマークは、MLLMの未解決の可能性を発見し、強化するために、研究コミュニティをより深く掘り下げることを奨励するものです。プロジェクトページ: https://q-future.github.io/q-bench。

論文の概要: Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

関連論文リスト