Assessing the Aesthetic Evaluation Capabilities of GPT-4 with Vision:
Insights from Group and Individual Assessments
- URL: http://arxiv.org/abs/2403.03594v1
- Date: Wed, 6 Mar 2024 10:27:09 GMT
- Title: Assessing the Aesthetic Evaluation Capabilities of GPT-4 with Vision:
Insights from Group and Individual Assessments
- Authors: Yoshia Abe, Tatsuya Daikoku, Yasuo Kuniyoshi
- Abstract summary: This study investigates the performance of GPT-4 with Vision on the task of aesthetic evaluation of images.
We employ two tasks, prediction of the average evaluation values of a group and an individual's evaluation values.
Experimental results reveal GPT-4 with Vision's superior performance in predicting aesthetic evaluations and the nature of different responses to beauty and ugliness.
- Score: 2.539875353011627
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, it has been recognized that large language models demonstrate high
performance on various intellectual tasks. However, few studies have
investigated alignment with humans in behaviors that involve sensibility, such
as aesthetic evaluation. This study investigates the performance of GPT-4 with
Vision, a state-of-the-art language model that can handle image input, on the
task of aesthetic evaluation of images. We employ two tasks, prediction of the
average evaluation values of a group and an individual's evaluation values. We
investigate the performance of GPT-4 with Vision by exploring prompts and
analyzing prediction behaviors. Experimental results reveal GPT-4 with Vision's
superior performance in predicting aesthetic evaluations and the nature of
different responses to beauty and ugliness. Finally, we discuss developing an
AI system for aesthetic evaluation based on scientific knowledge of the human
perception of beauty, employing agent technologies that integrate traditional
deep learning models with large language models.
Related papers
- Good Idea or Not, Representation of LLM Could Tell [86.36317971482755]
We focus on idea assessment, which aims to leverage the knowledge of large language models to assess the merit of scientific ideas.
We release a benchmark dataset from nearly four thousand manuscript papers with full texts, meticulously designed to train and evaluate the performance of different approaches to this task.
Our findings suggest that the representations of large language models hold more potential in quantifying the value of ideas than their generative outputs.
arXiv Detail & Related papers (2024-09-07T02:07:22Z) - Putting GPT-4o to the Sword: A Comprehensive Evaluation of Language, Vision, Speech, and Multimodal Proficiency [3.161954199291541]
This research study comprehensively evaluates the language, vision, speech, and multimodal capabilities of GPT-4o.
GPT-4o demonstrates high accuracy and efficiency across multiple domains in language and reasoning capabilities.
The model shows variability and faces limitations in handling complex and ambiguous inputs.
arXiv Detail & Related papers (2024-06-19T19:00:21Z) - Evaluating ChatGPT-4 Vision on Brazil's National Undergraduate Computer Science Exam [0.0]
This study investigates the performance of ChatGPT-4 Vision, OpenAI's most advanced visual model.
By presenting the model with the exam's open and multiple-choice questions in their original image format, we were able to evaluate the model's reasoning and self-reflecting capabilities.
ChatGPT-4 Vision significantly outperformed the average exam participant, positioning itself within the top 10 best score percentile.
arXiv Detail & Related papers (2024-06-14T02:42:30Z) - Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms [91.19304518033144]
We aim to align vision models with human aesthetic standards in a retrieval system.
We propose a preference-based reinforcement learning method that fine-tunes the vision models to better align the vision models with human aesthetics.
arXiv Detail & Related papers (2024-06-13T17:59:20Z) - Gemini vs GPT-4V: A Preliminary Comparison and Combination of
Vision-Language Models Through Qualitative Cases [98.35348038111508]
This paper presents an in-depth comparative study of two pioneering models: Google's Gemini and OpenAI's GPT-4V(ision)
The core of our analysis delves into the distinct visual comprehension abilities of each model.
Our findings illuminate the unique strengths and niches of both models.
arXiv Detail & Related papers (2023-12-22T18:59:58Z) - Grounded Intuition of GPT-Vision's Abilities with Scientific Images [44.44139684561664]
We formalize a process that many have instinctively been trying already to develop "grounded intuition" of GPT-Vision.
We use our technique to examine alt text generation for scientific figures, finding that GPT-Vision is particularly sensitive to prompting.
Our method and analysis aim to help researchers ramp up their own grounded intuitions of new models while exposing how GPT-Vision can be applied to make information more accessible.
arXiv Detail & Related papers (2023-11-03T17:53:43Z) - GPT-4V(ision) as a Generalist Evaluator for Vision-Language Tasks [70.98062518872999]
We validate GPT-4V's capabilities for evaluation purposes, addressing tasks ranging from foundational image-to-text and text-to-image synthesis to high-level image-to-image translations and multi-images to text alignment.
Notably, GPT-4V shows promising agreement with humans across various tasks and evaluation methods, demonstrating immense potential for multi-modal LLMs as evaluators.
arXiv Detail & Related papers (2023-11-02T16:11:09Z) - A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical
Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated.
It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports.
It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.