Towards Open-ended Visual Quality Comparison
- URL: http://arxiv.org/abs/2402.16641v2
- Date: Mon, 4 Mar 2024 07:06:22 GMT
- Title: Towards Open-ended Visual Quality Comparison
- Authors: Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen,
Liang Liao, Chunyi Li, Annan Wang, Wenxiu Sun, Qiong Yan, Xiaohong Liu,
Guangtao Zhai, Shiqi Wang, and Weisi Lin
- Abstract summary: We extend the edge of emerging large multi-modality models (LMMs) to advance visual quality comparison into open-ended settings.
Co-Instruct is a first-of-its-kind open-source open-ended visual quality comparer.
We demonstrate that Co-Instruct achieves in average 30% higher accuracy than state-of-the-art open-source LMMs.
- Score: 87.45004129101089
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Comparative settings (e.g. pairwise choice, listwise ranking) have been
adopted by a wide range of subjective studies for image quality assessment
(IQA), as it inherently standardizes the evaluation criteria across different
observers and offer more clear-cut responses. In this work, we extend the edge
of emerging large multi-modality models (LMMs) to further advance visual
quality comparison into open-ended settings, that 1) can respond to open-range
questions on quality comparison; 2) can provide detailed reasonings beyond
direct answers. To this end, we propose the Co-Instruct. To train this
first-of-its-kind open-source open-ended visual quality comparer, we collect
the Co-Instruct-562K dataset, from two sources: (a) LLM-merged single image
quality description, (b) GPT-4V "teacher" responses on unlabeled data.
Furthermore, to better evaluate this setting, we propose the MICBench, the
first benchmark on multi-image comparison for LMMs. We demonstrate that
Co-Instruct not only achieves in average 30% higher accuracy than
state-of-the-art open-source LMMs, but also outperforms GPT-4V (its teacher),
on both existing related benchmarks and the proposed MICBench. Our model is
published at https://huggingface.co/q-future/co-instruct.
Related papers
- Synthetic Multimodal Question Generation [60.33494376081317]
Multimodal Retrieval Augmented Generation (MMRAG) is a powerful approach to question-answering over multimodal documents.
We propose SMMQG, a synthetic data generation framework that generates question and answer pairs directly from multimodal documents.
We use SMMQG to generate an MMRAG dataset of 1024 questions over Wikipedia documents and evaluate state-of-the-art models using it.
arXiv Detail & Related papers (2024-07-02T12:57:42Z) - Selectively Answering Visual Questions [14.867972139262907]
Large multi-modal models (LMMs) have emerged with the capacity to perform vision tasks with unprecedented accuracy.
We perform the first in-depth analysis of calibration methods and metrics for visual question answering (VQA) with in-context learning LMMs.
We show that the likelihood score of visually grounded models is better than in their text-only counterparts for in-context learning.
arXiv Detail & Related papers (2024-06-03T04:28:10Z) - Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare [99.57567498494448]
We introduce Compare2Score, an all-around LMM-based no-reference IQA model.
During training, we generate scaled-up comparative instructions by comparing images from the same IQA dataset.
Experiments on nine IQA datasets validate that the Compare2Score effectively bridges text-defined comparative levels during training.
arXiv Detail & Related papers (2024-05-29T17:26:09Z) - Are We on the Right Way for Evaluating Large Vision-Language Models? [92.5761176224556]
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities.
We identify two primary issues: Visual content is unnecessary for many samples and intentional data leakage exists.
We present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.
arXiv Detail & Related papers (2024-03-29T17:59:34Z) - 2AFC Prompting of Large Multimodal Models for Image Quality Assessment [38.86162365208038]
Two-alternative forced choice (2AFC) prompting is widely regarded as the most reliable way of collecting human opinions of visual quality.
Global quality score of each image estimated by a particular LMM can be efficiently aggregated using the maximum a posterior estimation.
arXiv Detail & Related papers (2024-02-02T06:05:18Z) - Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined
Levels [95.44077384918725]
We propose to teach large multi-modality models (LMMs) with text-defined rating levels instead of scores.
The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA) and video quality assessment (VQA) tasks.
arXiv Detail & Related papers (2023-12-28T16:10:25Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z) - PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations [10.709365940160685]
Modern large language models (LLMs) are hard to evaluate and compare automatically.
We propose a peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs.
We find that our approaches achieve higher accuracy and align better with human judgments.
arXiv Detail & Related papers (2023-07-06T04:05:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.