Related papers: Towards Open-ended Visual Quality Comparison

Towards Open-ended Visual Quality Comparison

URL: http://arxiv.org/abs/2402.16641v2
Date: Mon, 4 Mar 2024 07:06:22 GMT
Title: Towards Open-ended Visual Quality Comparison
Authors: Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Annan Wang, Wenxiu Sun, Qiong Yan, Xiaohong Liu, Guangtao Zhai, Shiqi Wang, and Weisi Lin
Abstract summary: We extend the edge of emerging large multi-modality models (LMMs) to advance visual quality comparison into open-ended settings. Co-Instruct is a first-of-its-kind open-source open-ended visual quality comparer. We demonstrate that Co-Instruct achieves in average 30% higher accuracy than state-of-the-art open-source LMMs.
Score: 87.45004129101089
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Comparative settings (e.g. pairwise choice, listwise ranking) have been adopted by a wide range of subjective studies for image quality assessment (IQA), as it inherently standardizes the evaluation criteria across different observers and offer more clear-cut responses. In this work, we extend the edge of emerging large multi-modality models (LMMs) to further advance visual quality comparison into open-ended settings, that 1) can respond to open-range questions on quality comparison; 2) can provide detailed reasonings beyond direct answers. To this end, we propose the Co-Instruct. To train this first-of-its-kind open-source open-ended visual quality comparer, we collect the Co-Instruct-562K dataset, from two sources: (a) LLM-merged single image quality description, (b) GPT-4V "teacher" responses on unlabeled data. Furthermore, to better evaluate this setting, we propose the MICBench, the first benchmark on multi-image comparison for LMMs. We demonstrate that Co-Instruct not only achieves in average 30% higher accuracy than state-of-the-art open-source LMMs, but also outperforms GPT-4V (its teacher), on both existing related benchmarks and the proposed MICBench. Our model is published at https://huggingface.co/q-future/co-instruct.

Related papers

Teaching LMMs for Image Quality Scoring and Interpreting [71.1335005098584]
We propose Q-SiT (Quality Scoring and Interpreting joint Teaching), a unified framework that enables image quality scoring and interpreting simultaneously. Q-SiT is the first model capable of simultaneously performing image quality scoring and interpreting tasks, along with its lightweight variant, Q-SiT-mini. Experimental results demonstrate that Q-SiT achieves strong performance in both tasks with superior generalization IQA abilities.
arXiv Detail & Related papers (2025-03-12T09:39:33Z)
PairBench: A Systematic Framework for Selecting Reliable Judge VLMs [16.49586486795478]
We present PairBench, a framework that systematically evaluates large vision language models (VLMs) as customizable similarity tools. Through PairBench, we introduce four metrics that represent key desiderata of similarity scores. Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics.
arXiv Detail & Related papers (2025-02-21T04:53:11Z)
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [63.23935582919081]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) We introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs. We conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights.
arXiv Detail & Related papers (2025-02-13T18:59:46Z)
Synthetic Multimodal Question Generation [60.33494376081317]
Multimodal Retrieval Augmented Generation (MMRAG) is a powerful approach to question-answering over multimodal documents. We propose SMMQG, a synthetic data generation framework that generates question and answer pairs directly from multimodal documents. We use SMMQG to generate an MMRAG dataset of 1024 questions over Wikipedia documents and evaluate state-of-the-art models using it.
arXiv Detail & Related papers (2024-07-02T12:57:42Z)
Selectively Answering Visual Questions [14.867972139262907]
Large multi-modal models (LMMs) have emerged with the capacity to perform vision tasks with unprecedented accuracy. We perform the first in-depth analysis of calibration methods and metrics for visual question answering (VQA) with in-context learning LMMs. We show that the likelihood score of visually grounded models is better than in their text-only counterparts for in-context learning.
arXiv Detail & Related papers (2024-06-03T04:28:10Z)
Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare [99.57567498494448]
We introduce Compare2Score, an all-around LMM-based no-reference IQA model. During training, we generate scaled-up comparative instructions by comparing images from the same IQA dataset. Experiments on nine IQA datasets validate that the Compare2Score effectively bridges text-defined comparative levels during training.
arXiv Detail & Related papers (2024-05-29T17:26:09Z)
Are We on the Right Way for Evaluating Large Vision-Language Models? [92.5761176224556]
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. We identify two primary issues: Visual content is unnecessary for many samples and intentional data leakage exists. We present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.
arXiv Detail & Related papers (2024-03-29T17:59:34Z)
2AFC Prompting of Large Multimodal Models for Image Quality Assessment [38.86162365208038]
Two-alternative forced choice (2AFC) prompting is widely regarded as the most reliable way of collecting human opinions of visual quality. Global quality score of each image estimated by a particular LMM can be efficiently aggregated using the maximum a posterior estimation.
arXiv Detail & Related papers (2024-02-02T06:05:18Z)
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels [95.44077384918725]
We propose to teach large multi-modality models (LMMs) with text-defined rating levels instead of scores. The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA) and video quality assessment (VQA) tasks.
arXiv Detail & Related papers (2023-12-28T16:10:25Z)
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations [10.709365940160685]
Modern large language models (LLMs) are hard to evaluate and compare automatically. We propose a peer rank (PR) algorithm that takes into account each peer LLM's pairwise preferences of all answer pairs. We find that our approaches achieve higher accuracy and align better with human judgments.
arXiv Detail & Related papers (2023-07-06T04:05:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.