Related papers: UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

URL: http://arxiv.org/abs/2406.01069v1
Date: Mon, 3 Jun 2024 07:40:10 GMT
Title: UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment
Authors: Hantao Zhou, Longxiang Tang, Rui Yang, Guanyi Qin, Yan Zhang, Runze Hu, Xiu Li,
Abstract summary: Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Existing methods typically address these tasks independently due to distinct learning objectives. We propose Unified vision-language pre-training of Quality and Aesthetics (UniQA) to learn general perceptions of two tasks, thereby benefiting them simultaneously.
Score: 23.48816491333345
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image Quality Assessment (IQA) and Image Aesthetic Assessment (IAA) aim to simulate human subjective perception of image visual quality and aesthetic appeal. Existing methods typically address these tasks independently due to distinct learning objectives. However, they neglect the underlying interconnectedness of both tasks, which hinders the learning of task-agnostic shared representations for human subjective perception. To confront this challenge, we propose Unified vision-language pre-training of Quality and Aesthetics (UniQA), to learn general perceptions of two tasks, thereby benefiting them simultaneously. Addressing the absence of text in the IQA datasets and the presence of textual noise in the IAA datasets, (1) we utilize multimodal large language models (MLLMs) to generate high-quality text descriptions; (2) the generated text for IAA serves as metadata to purify noisy IAA data. To effectively adapt the pre-trained UniQA to downstream tasks, we further propose a lightweight adapter that utilizes versatile cues to fully exploit the extensive knowledge of the pre-trained model. Extensive experiments demonstrate that our approach attains a new state-of-the-art performance on both IQA and IAA tasks, while concurrently showcasing exceptional zero-shot and few-label image assessment capabilities. The source code will be available at https://github.com/zht8506/UniQA.

Related papers

TRIQA: Image Quality Assessment by Contrastive Pretraining on Ordered Distortion Triplets [31.2422359004089]
No-Reference (NR) IQA remains particularly challenging due to the absence of a reference image.<n>We propose a novel approach that constructs a custom dataset using a limited number of reference content images.<n>We train a quality-aware model using contrastive triplet-based learning, enabling efficient training with fewer samples.
arXiv Detail & Related papers (2025-07-16T23:43:12Z)
Teaching LMMs for Image Quality Scoring and Interpreting [71.1335005098584]
We propose Q-SiT (Quality Scoring and Interpreting joint Teaching), a unified framework that enables image quality scoring and interpreting simultaneously. Q-SiT is the first model capable of simultaneously performing image quality scoring and interpreting tasks, along with its lightweight variant, Q-SiT-mini. Experimental results demonstrate that Q-SiT achieves strong performance in both tasks with superior generalization IQA abilities.
arXiv Detail & Related papers (2025-03-12T09:39:33Z)
Next Token Is Enough: Realistic Image Quality and Aesthetic Scoring with Multimodal Large Language Model [19.2881640541533]
Large language models (MLLMs) have shown great potential in image quality assessment (IQA) and image aesthetic assessment (IAA)<n>Here, we introduce a novel dataset, named Realistic image Quality and Aesthetic (RealQA)<n>These attributes span three levels: low level (e.g., image clarity), middle level (e.g., subject integrity) and high level (e.g. composition)<n>Surprisingly, by predicting just two extra significant digits, the next token paradigm can achieve SOTA performance.
arXiv Detail & Related papers (2025-03-08T09:49:10Z)
AI-generated Image Quality Assessment in Visual Communication [72.11144790293086]
AIGI-VC is a quality assessment database for AI-generated images in visual communication. The dataset consists of 2,500 images spanning 14 advertisement topics and 8 emotion types. It provides coarse-grained human preference annotations and fine-grained preference descriptions, benchmarking the abilities of IQA methods in preference prediction, interpretation, and reasoning.
arXiv Detail & Related papers (2024-12-20T08:47:07Z)
Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment [57.07360640784803]
We propose vision-language consistency guided multi-modal prompt learning for blind image quality assessment (AGIQA) Specifically, we introduce learnable textual and visual prompts in language and vision branches of Contrastive Language-Image Pre-training (CLIP) models. We design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts.
arXiv Detail & Related papers (2024-06-24T13:45:31Z)
DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild [54.139923409101044]
Blind image quality assessment (IQA) in the wild presents significant challenges. Given the difficulty in collecting large-scale training data, leveraging limited data to develop a model with strong generalization remains an open problem. Motivated by the robust image perception capabilities of pre-trained text-to-image (T2I) diffusion models, we propose a novel IQA method, diffusion priors-based IQA.
arXiv Detail & Related papers (2024-05-30T12:32:35Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
Descriptive Image Quality Assessment in the Wild [25.503311093471076]
VLM-based Image Quality Assessment (IQA) seeks to describe image quality linguistically to align with human expression. We introduce Depicted image Quality Assessment in the Wild (DepictQA-Wild) Our method includes a multi-functional IQA task paradigm that encompasses both assessment and comparison tasks, brief and detailed responses, full-reference and non-reference scenarios.
arXiv Detail & Related papers (2024-05-29T07:49:15Z)
Large Multi-modality Model Assisted AI-Generated Image Quality Assessment [53.182136445844904]
We introduce a large Multi-modality model Assisted AI-Generated Image Quality Assessment (MA-AGIQA) model. It uses semantically informed guidance to sense semantic information and extract semantic vectors through carefully designed text prompts. It achieves state-of-the-art performance, and demonstrates its superior generalization capabilities on assessing the quality of AI-generated images.
arXiv Detail & Related papers (2024-04-27T02:40:36Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
Learning Generalizable Perceptual Representations for Data-Efficient No-Reference Image Quality Assessment [7.291687946822539]
A major drawback of state-of-the-art NR-IQA techniques is their reliance on a large number of human annotations. We enable the learning of low-level quality features to distortion types by introducing a novel quality-aware contrastive loss. We design zero-shot quality predictions from both pathways in a completely blind setting.
arXiv Detail & Related papers (2023-12-08T05:24:21Z)
Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective [93.56647950778357]
Blind image quality assessment (BIQA) predicts the human perception of image quality without any reference information. We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks.
arXiv Detail & Related papers (2023-03-27T07:58:09Z)
Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering [17.51860125438028]
We propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters.
arXiv Detail & Related papers (2021-09-15T14:11:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.