Related papers: Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment

URL: http://arxiv.org/abs/2406.09858v2
Date: Fri, 21 Jun 2024 04:45:04 GMT
Title: Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment
Authors: Fei Zhou, Zhicong Huang, Tianhao Gu, Guoping Qiu,
Abstract summary: Distilling high level knowledge about quality bearing attributes is crucial for developing objective Image Quality Assessment (IQA) We present a new blind IQA (BIQA) model termed Self-supervision and Vision-Language supervision Image QUality Evaluator (SLIQUE) SLIQUE features a joint vision-language and visual contrastive representation learning framework for acquiring high level knowledge about the images semantic contents, distortion characteristics and appearance properties for IQA.
Score: 20.851102845794244
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The visual quality of an image is confounded by a number of intertwined factors including its semantic content, distortion characteristics and appearance properties such as brightness, contrast, sharpness, and colourfulness. Distilling high level knowledge about all these quality bearing attributes is crucial for developing objective Image Quality Assessment (IQA).While existing solutions have modeled some of these aspects, a comprehensive solution that involves all these important quality related attributes has not yet been developed. In this paper, we present a new blind IQA (BIQA) model termed Self-supervision and Vision-Language supervision Image QUality Evaluator (SLIQUE) that features a joint vision-language and visual contrastive representation learning framework for acquiring high level knowledge about the images semantic contents, distortion characteristics and appearance properties for IQA. For training SLIQUE, we have developed a systematic approach to constructing a first of its kind large image database annotated with all three categories of quality relevant texts. The Text Annotated Distortion, Appearance and Content (TADAC) database has over 1.6 million images annotated with textual descriptions of their semantic contents, distortion characteristics and appearance properties. The method for constructing TADAC and the database itself will be particularly useful for exploiting vision-language modeling for advanced IQA applications. Extensive experimental results show that SLIQUE has superior performances over state of the art, demonstrating the soundness of its design principle and the effectiveness of its implementation.

Related papers

Q-Insight: Understanding Image Quality via Visual Reinforcement Learning [27.26829134776367]
Image quality assessment (IQA) focuses on the perceptual visual quality of images, playing a crucial role in downstream tasks such as image reconstruction, compression, and generation. We propose Q-Insight, a reinforcement learning-based model built upon group relative policy optimization (GRPO) We show that Q-Insight substantially outperforms existing state-of-the-art methods in both score regression and degradation perception tasks.
arXiv Detail & Related papers (2025-03-28T17:59:54Z)
AI-generated Image Quality Assessment in Visual Communication [72.11144790293086]
AIGI-VC is a quality assessment database for AI-generated images in visual communication. The dataset consists of 2,500 images spanning 14 advertisement topics and 8 emotion types. It provides coarse-grained human preference annotations and fine-grained preference descriptions, benchmarking the abilities of IQA methods in preference prediction, interpretation, and reasoning.
arXiv Detail & Related papers (2024-12-20T08:47:07Z)
ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling [32.55352435358949]
We propose a sentence generation-based retrieval formulation for attribute recognition. For each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets.
arXiv Detail & Related papers (2024-08-07T21:44:29Z)
Q-Ground: Image Quality Grounding with Large Multi-modality Models [61.72022069880346]
We introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding. Q-Ground combines large multi-modality models with detailed visual quality analysis. Central to our contribution is the introduction of the QGround-100K dataset.
arXiv Detail & Related papers (2024-07-24T06:42:46Z)
Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment [57.07360640784803]
We propose vision-language consistency guided multi-modal prompt learning for blind image quality assessment (AGIQA) Specifically, we introduce learnable textual and visual prompts in language and vision branches of Contrastive Language-Image Pre-training (CLIP) models. We design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts.
arXiv Detail & Related papers (2024-06-24T13:45:31Z)
UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment [23.48816491333345]
Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Existing methods typically address these tasks independently due to distinct learning objectives. We propose Unified vision-language pre-training of Quality and Aesthetics (UniQA) to learn general perceptions of two tasks, thereby benefiting them simultaneously.
arXiv Detail & Related papers (2024-06-03T07:40:10Z)
Large Multi-modality Model Assisted AI-Generated Image Quality Assessment [53.182136445844904]
We introduce a large Multi-modality model Assisted AI-Generated Image Quality Assessment (MA-AGIQA) model. It uses semantically informed guidance to sense semantic information and extract semantic vectors through carefully designed text prompts. It achieves state-of-the-art performance, and demonstrates its superior generalization capabilities on assessing the quality of AI-generated images.
arXiv Detail & Related papers (2024-04-27T02:40:36Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models [23.99102775778499]
This paper introduces IQAGPT, an innovative image quality assessment system integrating an image quality captioning VLM with ChatGPT. We build a CT-IQA dataset for training and evaluation, comprising 1,000 CT slices with diverse quality levels professionally annotated. To better leverage the capabilities of LLMs, we convert annotated quality scores into semantically rich text descriptions using a prompt template.
arXiv Detail & Related papers (2023-12-25T09:13:18Z)
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective [93.56647950778357]
Blind image quality assessment (BIQA) predicts the human perception of image quality without any reference information. We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks.
arXiv Detail & Related papers (2023-03-27T07:58:09Z)
Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner. Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.