HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment
- URL: http://arxiv.org/abs/2503.23907v1
- Date: Mon, 31 Mar 2025 09:58:11 GMT
- Title: HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment
- Authors: Zhichao Liao, Xiaokun Liu, Wenyu Qin, Qingyu Li, Qiulin Wang, Pengfei Wan, Di Zhang, Long Zeng, Pingfa Feng,
- Abstract summary: HumanBeauty is the first dataset purpose-built for Human Image Aesthetic Assessment (HIAA)<n>We propose HumanAesExpert, a powerful Vision Language Model for aesthetic evaluation of human images.<n>Our models deliver significantly better performance in HIAA than other state-of-the-art models.
- Score: 11.253286640424811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image Aesthetic Assessment (IAA) is a long-standing and challenging research task. However, its subset, Human Image Aesthetic Assessment (HIAA), has been scarcely explored, even though HIAA is widely used in social media, AI workflows, and related domains. To bridge this research gap, our work pioneers a holistic implementation framework tailored for HIAA. Specifically, we introduce HumanBeauty, the first dataset purpose-built for HIAA, which comprises 108k high-quality human images with manual annotations. To achieve comprehensive and fine-grained HIAA, 50K human images are manually collected through a rigorous curation process and annotated leveraging our trailblazing 12-dimensional aesthetic standard, while the remaining 58K with overall aesthetic labels are systematically filtered from public datasets. Based on the HumanBeauty database, we propose HumanAesExpert, a powerful Vision Language Model for aesthetic evaluation of human images. We innovatively design an Expert head to incorporate human knowledge of aesthetic sub-dimensions while jointly utilizing the Language Modeling (LM) and Regression head. This approach empowers our model to achieve superior proficiency in both overall and fine-grained HIAA. Furthermore, we introduce a MetaVoter, which aggregates scores from all three heads, to effectively balance the capabilities of each head, thereby realizing improved assessment precision. Extensive experiments demonstrate that our HumanAesExpert models deliver significantly better performance in HIAA than other state-of-the-art models. Our datasets, models, and codes are publicly released to advance the HIAA community. Project webpage: https://humanaesexpert.github.io/HumanAesExpert/
Related papers
- Human-Centric Foundation Models: Perception, Generation and Agentic Modeling [79.97999901785772]
Human-centric Foundation Models unify diverse human-centric tasks into a single framework.
We present a comprehensive overview of HcFMs by proposing a taxonomy that categorizes current approaches into four groups.
This survey aims to serve as a roadmap for researchers and practitioners working towards more robust, versatile, and intelligent digital human and embodiments modeling.
arXiv Detail & Related papers (2025-02-12T16:38:40Z) - HumanVLM: Foundation for Human-Scene Vision-Language Model [3.583459930633303]
Human-scene vision-language tasks are increasingly prevalent in diverse social applications.
This study introduces a domain-specific Large Vision-Language Model, Human-Scene Vision-Language Model (HumanVLM)
In the experiments, we then evaluate our HumanVLM across varous downstream tasks, where it demonstrates superior overall performance.
arXiv Detail & Related papers (2024-11-05T12:14:57Z) - Sapiens: Foundation for Human Vision Models [14.72839332332364]
We present Sapiens, a family of models for four fundamental human-centric vision tasks.
Our models support 1K high-resolution inference and are easy to adapt for individual tasks.
We observe that self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks.
arXiv Detail & Related papers (2024-08-22T17:37:27Z) - Are They the Same Picture? Adapting Concept Bottleneck Models for Human-AI Collaboration in Image Retrieval [3.2495565849970016]
textttCHAIR enables humans to correct intermediate concepts, which helps textitimprove embeddings generated.
We show that our method performs better than similar models on image retrieval metrics without any external intervention.
arXiv Detail & Related papers (2024-07-12T00:59:32Z) - Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms [91.19304518033144]
We aim to align vision models with human aesthetic standards in a retrieval system.
We propose a preference-based reinforcement learning method that fine-tunes the vision models to better align the vision models with human aesthetics.
arXiv Detail & Related papers (2024-06-13T17:59:20Z) - HINT: Learning Complete Human Neural Representations from Limited Viewpoints [69.76947323932107]
We propose a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles.
As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR.
arXiv Detail & Related papers (2024-05-30T05:43:09Z) - UniHuman: A Unified Model for Editing Human Images in the Wild [49.896715833075106]
We propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings.
To enhance the model's generation quality and generalization capacity, we leverage guidance from human visual encoders.
In user studies, UniHuman is preferred by the users in an average of 77% of cases.
arXiv Detail & Related papers (2023-12-22T05:00:30Z) - Towards Artistic Image Aesthetics Assessment: a Large-scale Dataset and
a New Method [64.40494830113286]
We first introduce a large-scale AIAA dataset: Boldbrush Artistic Image dataset (BAID), which consists of 60,337 artistic images covering various art forms.
We then propose a new method, SAAN, which can effectively extract and utilize style-specific and generic aesthetic information to evaluate artistic images.
Experiments demonstrate that our proposed approach outperforms existing IAA methods on the proposed BAID dataset.
arXiv Detail & Related papers (2023-03-27T12:59:15Z) - VILA: Learning Image Aesthetics from User Comments with Vision-Language
Pretraining [53.470662123170555]
We propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations.
Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels.
Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset.
arXiv Detail & Related papers (2023-03-24T23:57:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.