Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment
- URL: http://arxiv.org/abs/2411.17237v1
- Date: Tue, 26 Nov 2024 09:03:16 GMT
- Title: Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment
- Authors: Zheng Chen, Xun Zhang, Wenbo Li, Renjing Pei, Fenglong Song, Xiongkuo Min, Xiaohong Liu, Xin Yuan, Yong Guo, Yulun Zhang,
- Abstract summary: We introduce a new image quality assessment (IQA) task paradigm, grounding-IQA.
Grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA)
To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline.
Experiments demonstrate that our proposed task paradigm, dataset, and benchmark facilitate the more fine-grained IQA application.
- Score: 69.07445098168344
- License:
- Abstract: The development of multimodal large language models (MLLMs) enables the evaluation of image quality through natural language descriptions. This advancement allows for more detailed assessments. However, these MLLM-based IQA methods primarily rely on general contextual descriptions, sometimes limiting fine-grained quality assessment. To address this limitation, we introduce a new image quality assessment (IQA) task paradigm, grounding-IQA. This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception. Specifically, grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA). GIQA-DES involves detailed descriptions with precise locations (e.g., bounding boxes), while GIQA-VQA focuses on quality QA for local regions. To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline. Furthermore, we develop a well-designed benchmark, GIQA-Bench. The benchmark comprehensively evaluates the model grounding-IQA performance from three perspectives: description quality, VQA accuracy, and grounding precision. Experiments demonstrate that our proposed task paradigm, dataset, and benchmark facilitate the more fine-grained IQA application. Code: https://github.com/zhengchen1999/Grounding-IQA.
Related papers
- Dog-IQA: Standard-guided Zero-shot MLLM for Mix-grained Image Quality Assessment [57.10083003305353]
We propose Dog-IQA, a standard-guided zero-shot mix-grained IQA method, which is training-free and utilizes the exceptional prior knowledge of multimodal large language models (MLLMs)
Dog-IQA objectively scores with specific standards that utilize MLLM's behavior pattern and minimize the influence of subjective factors.
Our proposed Dog-IQA achieves state-of-the-art (SOTA) performance compared with training-free methods, and competitive performance compared with training-based methods in cross-dataset scenarios.
arXiv Detail & Related papers (2024-10-03T14:14:21Z) - Boosting CLIP Adaptation for Image Quality Assessment via Meta-Prompt Learning and Gradient Regularization [55.09893295671917]
This paper introduces a novel Gradient-Regulated Meta-Prompt IQA Framework (GRMP-IQA)
The GRMP-IQA comprises two key modules: Meta-Prompt Pre-training Module and Quality-Aware Gradient Regularization.
Experiments on five standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods under limited data setting.
arXiv Detail & Related papers (2024-09-09T07:26:21Z) - Bringing Textual Prompt to AI-Generated Image Quality Assessment [4.230780744307392]
IP-IQA (AGIs Quality Assessment via Image and Prompt) is a multimodal framework for AGIQA via corresponding image and prompt incorporation.
An effective and efficient image-prompt fusion module, along with a novel special [QA] token, are also applied.
Experiments demonstrate that our IP-IQA achieves the state-of-the-art on AGIQA-1k and AGIQA-3k datasets.
arXiv Detail & Related papers (2024-03-27T16:02:00Z) - Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models [28.194638379354252]
We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based methods.
DepictQA allows for detailed, language-based, human-like evaluation of image quality by leveraging Multi-modal Large Language Models.
These results showcase the research potential of multi-modal IQA methods.
arXiv Detail & Related papers (2023-12-14T14:10:02Z) - NuScenes-MQA: Integrated Evaluation of Captions and QA for Autonomous
Driving Datasets using Markup Annotations [0.6827423171182154]
Visual Question Answering (VQA) is one of the most important tasks in autonomous driving.
We introduce a novel dataset annotation technique in which QAs are enclosed within markups.
This dataset empowers the development of vision language models, especially for autonomous driving tasks.
arXiv Detail & Related papers (2023-12-11T12:58:54Z) - Blind Image Quality Assessment via Vision-Language Correspondence: A
Multitask Learning Perspective [93.56647950778357]
Blind image quality assessment (BIQA) predicts the human perception of image quality without any reference information.
We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks.
arXiv Detail & Related papers (2023-03-27T07:58:09Z) - RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question
Answering [87.18962441714976]
We introduce RoMQA, the first benchmark for robust, multi-evidence, multi-answer question answering (QA)
We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, and find that RoMQA is challenging.
Our results show that RoMQA is a challenging benchmark for large language models, and provides a quantifiable test to build more robust QA methods.
arXiv Detail & Related papers (2022-10-25T21:39:36Z) - Generating Diverse and Consistent QA pairs from Contexts with
Information-Maximizing Hierarchical Conditional VAEs [62.71505254770827]
We propose a conditional variational autoencoder (HCVAE) for generating QA pairs given unstructured texts as contexts.
Our model obtains impressive performance gains over all baselines on both tasks, using only a fraction of data for training.
arXiv Detail & Related papers (2020-05-28T08:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.