Perception of Visual Content: Differences Between Humans and Foundation Models
- URL: http://arxiv.org/abs/2411.18968v1
- Date: Thu, 28 Nov 2024 07:37:04 GMT
- Title: Perception of Visual Content: Differences Between Humans and Foundation Models
- Authors: Nardiena A. Pratama, Shaoyang Fan, Gianluca Demartini,
- Abstract summary: This study compares human-generated and ML-generated annotations of images representing diverse socio-economic contexts.<n>Our dataset comprises images of people from various geographical regions and income levels washing their hands.
- Score: 4.251488927334905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-annotated content is often used to train machine learning (ML) models. However, recently, language and multi-modal foundational models have been used to replace and scale-up human annotator's efforts. This study compares human-generated and ML-generated annotations of images representing diverse socio-economic contexts. We aim to understand differences in perception and identify potential biases in content interpretation. Our dataset comprises images of people from various geographical regions and income levels washing their hands. We compare human and ML-generated annotations semantically and evaluate their impact on predictive models. Our results show low similarity between human and machine annotations from a low-level perspective, i.e., types of words that appear and sentence structures, but are alike in how similar or dissimilar they perceive images across different regions. Additionally, human annotations resulted in best overall and most balanced region classification performance on the class level, while ML Objects and ML Captions performed best for income regression. Humans and machines' similarity in their lack of bias when perceiving images highlights how they are more alike than what was initially perceived. The superior and fairer performance of using human annotations for region classification and machine annotations for income regression show how important the quality of the images and the discriminative features in the annotations are.
Related papers
- MLLMs are Deeply Affected by Modality Bias [158.64371871084478]
Recent advances in Multimodal Large Language Models (MLLMs) have shown promising results in integrating diverse modalities such as texts and images.<n>MLLMs are heavily influenced by modality bias, often relying on language while under-utilizing other modalities like visual inputs.<n>This paper argues that MLLMs are deeply affected by modality bias, highlighting its manifestations across various tasks.
arXiv Detail & Related papers (2025-05-24T11:49:31Z) - Detecting Linguistic Indicators for Stereotype Assessment with Large Language Models [0.9285295512807729]
Social categories and stereotypes are embedded in language and can introduce data bias into Large Language Models.
We propose a new approach that detects and quantifies the linguistic indicators of stereotypes in a sentence.
arXiv Detail & Related papers (2025-02-26T14:15:28Z) - Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities [31.293869275511412]
This paper thoroughly revisits the Multimodal Large Language Models (MLLMs) with an in-depth analysis of image classification.
Our findings reveal that the most recent MLLMs can match or even outperform CLIP-style vision-language models on several datasets.
arXiv Detail & Related papers (2024-12-21T00:46:56Z) - Verbalized Representation Learning for Interpretable Few-Shot Generalization [130.8173035901391]
Verbalized Representation Learning (VRL) is a novel approach for automatically extracting human-interpretable features for object recognition.
Our method captures inter-class differences and intra-class commonalities in the form of natural language.
VRL achieves a 24% absolute improvement over prior state-of-the-art methods.
arXiv Detail & Related papers (2024-11-27T01:55:08Z) - When Does Perceptual Alignment Benefit Vision Representations? [76.32336818860965]
We investigate how aligning vision model representations to human perceptual judgments impacts their usability.
We find that aligning models to perceptual judgments yields representations that improve upon the original backbones across many downstream tasks.
Our results suggest that injecting an inductive bias about human perceptual knowledge into vision models can contribute to better representations.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - HLB: Benchmarking LLMs' Humanlikeness in Language Use [2.438748974410787]
We present a comprehensive humanlikeness benchmark (HLB) evaluating 20 large language models (LLMs)
We collected responses from over 2,000 human participants and compared them to outputs from the LLMs in these experiments.
Our results reveal fine-grained differences in how well LLMs replicate human responses across various linguistic levels.
arXiv Detail & Related papers (2024-09-24T09:02:28Z) - Whose Preferences? Differences in Fairness Preferences and Their Impact on the Fairness of AI Utilizing Human Feedback [8.04095222893591]
We find significant gaps in fairness preferences depending on the race, age, political stance, educational level, and LGBTQ+ identity of annotators.
We also demonstrate that demographics mentioned in text have a strong influence on how users perceive individual fairness in moderation.
arXiv Detail & Related papers (2024-06-09T19:42:25Z) - Evaluating Vision-Language Models on Bistable Images [34.492117496933915]
This study is the most extensive examination of vision-language models using bistable images to date.
We manually gathered a dataset of 29 bistable images, along with their associated labels, and subjected them to 116 different manipulations in brightness, tint, and rotation.
Our findings reveal that, with the exception of models from the Idefics family and LLaVA1.5-13b, there is a pronounced preference for one interpretation over another.
arXiv Detail & Related papers (2024-05-29T18:04:59Z) - Naming, Describing, and Quantifying Visual Objects in Humans and LLMs [5.59181673439492]
We evaluate Vision & Language Large Language Models (VLLMs) on three categories (nouns, attributes, and quantifiers)
We find mixed evidence on the ability of VLLMs to capture human naming preferences at generation time.
arXiv Detail & Related papers (2024-03-11T17:20:12Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - Auditing Gender Presentation Differences in Text-to-Image Models [54.16959473093973]
We study how gender is presented differently in text-to-image models.
By probing gender indicators in the input text, we quantify the frequency differences of presentation-centric attributes.
We propose an automatic method to estimate such differences.
arXiv Detail & Related papers (2023-02-07T18:52:22Z) - On the Compositional Generalization Gap of In-Context Learning [73.09193595292233]
We look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning.
We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets.
arXiv Detail & Related papers (2022-11-15T19:56:37Z) - VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z) - The Need for Interpretable Features: Motivation and Taxonomy [69.07189753428553]
We claim that the term "interpretable feature" is not specific nor detailed enough to capture the full extent to which features impact the usefulness of machine learning explanations.
In this paper, we motivate and discuss three key lessons: 1) more attention should be given to what we refer to as the interpretable feature space, or the state of features that are useful to domain experts taking real-world actions.
arXiv Detail & Related papers (2022-02-23T19:19:14Z) - Exploring Alignment of Representations with Human Perception [47.53970721813083]
We show that inputs that are mapped to similar representations by the model should be perceived similarly by humans.
Our approach yields a measure of the extent to which a model is aligned with human perception.
We find that various properties of a model like its architecture, training paradigm, training loss, and data augmentation play a significant role in learning representations that are aligned with human perception.
arXiv Detail & Related papers (2021-11-29T17:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.