Related papers: Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English

Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English

URL: http://arxiv.org/abs/2507.00700v1
Date: Tue, 01 Jul 2025 11:56:45 GMT
Title: Contrasting Cognitive Styles in Vision-Language Models: Holistic Attention in Japanese Versus Analytical Focus in English
Authors: Ahmed Sabir, Azinovič Gasper, Mengsay Loem, Rajesh Sharma,
Abstract summary: We investigate whether Vision-Language Models (VLMs) trained predominantly on different languages, specifically Japanese and English, exhibit similar culturally grounded attentional patterns.<n>Our findings suggest that VLMs not only internalize the structural properties of language but also reproduce cultural behaviors embedded in the training data, indicating that cultural cognition may implicitly shape model outputs.
Score: 4.8310710966636545
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cross-cultural research in perception and cognition has shown that individuals from different cultural backgrounds process visual information in distinct ways. East Asians, for example, tend to adopt a holistic perspective, attending to contextual relationships, whereas Westerners often employ an analytical approach, focusing on individual objects and their attributes. In this study, we investigate whether Vision-Language Models (VLMs) trained predominantly on different languages, specifically Japanese and English, exhibit similar culturally grounded attentional patterns. Using comparative analysis of image descriptions, we examine whether these models reflect differences in holistic versus analytic tendencies. Our findings suggest that VLMs not only internalize the structural properties of language but also reproduce cultural behaviors embedded in the training data, indicating that cultural cognition may implicitly shape model outputs.

Related papers

Cultural Awareness in Vision-Language Models: A Cross-Country Exploration [5.921976812527759]
Vision-Language Models (VLMs) are increasingly deployed in diverse cultural contexts.<n>We propose a novel framework to evaluate how VLMs encode cultural differences and biases related to race, gender, and physical traits across countries.
arXiv Detail & Related papers (2025-05-23T18:47:52Z)
Linguistics-aware Masked Image Modeling for Self-supervised Scene Text Recognition [50.86415025650168]
Masked image modeling (MIM) tends to exploit local structures to reconstruct visual patterns, resulting in limited linguistic knowledge.<n>We propose a Linguistics-aware Masked Image Modeling (LMIM) approach, which channels the linguistic information into the decoding process of MIM through a separate branch.
arXiv Detail & Related papers (2025-03-24T14:53:35Z)
Risks of Cultural Erasure in Large Language Models [4.613949381428196]
We argue for the need of metricizable evaluations of language technologies that interrogate and account for historical power inequities.<n>We probe representations that a language model produces about different places around the world when asked to describe these contexts.<n>We analyze the cultures represented in the travel recommendations produced by a set of language model applications.
arXiv Detail & Related papers (2025-01-02T04:57:50Z)
KULTURE Bench: A Benchmark for Assessing Language Model in Korean Cultural Context [5.693660906643207]
We introduce KULTURE Bench, an evaluation framework specifically designed for Korean culture.<n>It is designed to assess language models' cultural comprehension and reasoning capabilities at the word, sentence, and paragraph levels.<n>The results show that there is still significant room for improvement in the models' understanding of texts related to the deeper aspects of Korean culture.
arXiv Detail & Related papers (2024-12-10T07:20:51Z)
CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts [45.77570690529597]
We introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts.<n>Our evaluation of several state-of-the-art open Vision and Language models shows large performance disparities between culture-specific and common concepts.<n>Experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions.
arXiv Detail & Related papers (2024-10-20T17:31:19Z)
See It from My Perspective: How Language Affects Cultural Bias in Image Understanding [60.70852566256668]
Vision-language models (VLMs) can respond to queries about images in many languages.<n>We characterize the Western bias of VLMs in image understanding and investigate the role that language plays in this disparity.
arXiv Detail & Related papers (2024-06-17T15:49:51Z)
Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks. We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z)
Semantic and Expressive Variation in Image Captions Across Languages [26.766596770616655]
We study how people from different cultural backgrounds observe vastly different concepts even when viewing the same visual stimuli.<n>By comparing textual descriptions generated across 7 languages for the same images, we find significant differences in the semantic content and linguistic expression.<n>Our work points towards the need to accounttuning for and embrace the diversity of human perception in the computer vision community.
arXiv Detail & Related papers (2023-10-22T16:51:42Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
Perception Point: Identifying Critical Learning Periods in Speech for Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models. We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z)
Identifying Distributional Perspective Differences from Colingual Groups [41.58939666949895]
A lack of mutual understanding among different groups about their perspectives on specific values or events may lead to uninformed decisions or biased opinions. We study colingual groups and use language corpora as a proxy to identify their distributional perspectives. We present a novel computational approach to learn shared understandings, and benchmark our method by building culturally-aware models for the English, Chinese, and Japanese languages.
arXiv Detail & Related papers (2020-04-10T08:13:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.