Computer Vision Datasets and Models Exhibit Cultural and Linguistic
Diversity in Perception
- URL: http://arxiv.org/abs/2310.14356v3
- Date: Sat, 9 Mar 2024 20:47:30 GMT
- Title: Computer Vision Datasets and Models Exhibit Cultural and Linguistic
Diversity in Perception
- Authors: Andre Ye, Sebastin Santy, Jena D. Hwang, Amy X. Zhang, Ranjay Krishna
- Abstract summary: We study how people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli.
By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression.
Our work points towards the need to accounttuning for and embrace the diversity of human perception in the computer vision community.
- Score: 28.716435050743957
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Computer vision often treats human perception as homogeneous: an implicit
assumption that visual stimuli are perceived similarly by everyone. This
assumption is reflected in the way researchers collect datasets and train
vision models. By contrast, literature in cross-cultural psychology and
linguistics has provided evidence that people from different cultural
backgrounds observe vastly different concepts even when viewing the same visual
stimuli. In this paper, we study how these differences manifest themselves in
vision-language datasets and models, using language as a proxy for culture. By
comparing textual descriptions generated across 7 languages for the same
images, we find significant differences in the semantic content and linguistic
expression. When datasets are multilingual as opposed to monolingual,
descriptions have higher semantic coverage on average, where coverage is
measured using scene graphs, model embeddings, and linguistic taxonomies. For
example, multilingual descriptions have on average 29.9% more objects, 24.5%
more relations, and 46.0% more attributes than a set of monolingual captions.
When prompted to describe images in different languages, popular models (e.g.
LLaVA) inherit this bias and describe different parts of the image. Moreover,
finetuning models on captions from one language performs best on corresponding
test data from that language, while finetuning on multilingual data performs
consistently well across all test data compositions. Our work points towards
the need to account for and embrace the diversity of human perception in the
computer vision community.
Related papers
- Cross-Lingual and Cross-Cultural Variation in Image Descriptions [2.8664758928324883]
We conduct the first large-scale empirical study of cross-lingual variation in image descriptions.
We use a multimodal dataset with 31 languages and images from diverse locations.
Our analysis reveals that pairs of languages that are geographically or genetically closer tend to mention the same entities more frequently.
arXiv Detail & Related papers (2024-09-25T05:57:09Z) - CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - Multilingual Diversity Improves Vision-Language Representations [66.41030381363244]
Pre-training on this dataset outperforms using English-only or English-dominated datasets on ImageNet.
On a geographically diverse task like GeoDE, we also observe improvements across all regions, with the biggest gain coming from Africa.
arXiv Detail & Related papers (2024-05-27T08:08:51Z) - Exploring Anisotropy and Outliers in Multilingual Language Models for
Cross-Lingual Semantic Sentence Similarity [64.18762301574954]
Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings.
This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context.
We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models.
arXiv Detail & Related papers (2023-06-01T09:01:48Z) - Multi-lingual and Multi-cultural Figurative Language Understanding [69.47641938200817]
Figurative language permeates human communication, but is relatively understudied in NLP.
We create a dataset for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba.
Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region.
All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data.
arXiv Detail & Related papers (2023-05-25T15:30:31Z) - Comparing Biases and the Impact of Multilingual Training across Multiple
Languages [70.84047257764405]
We present a bias analysis across Italian, Chinese, English, Hebrew, and Spanish on the downstream sentiment analysis task.
We adapt existing sentiment bias templates in English to Italian, Chinese, Hebrew, and Spanish for four attributes: race, religion, nationality, and gender.
Our results reveal similarities in bias expression such as favoritism of groups that are dominant in each language's culture.
arXiv Detail & Related papers (2023-05-18T18:15:07Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Deception detection in text and its relation to the cultural dimension
of individualism/collectivism [6.17866386107486]
We investigate if differences in the usage of specific linguistic features of deception across cultures can be confirmed and attributed to norms in respect to the individualism/collectivism divide.
We create culture/language-aware classifiers by experimenting with a wide range of n-gram features based on phonology, morphology and syntax.
We conducted our experiments over 11 datasets from 5 languages i.e., English, Dutch, Russian, Spanish and Romanian, from six countries (US, Belgium, India, Russia, Mexico and Romania)
arXiv Detail & Related papers (2021-05-26T13:09:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.