CIC: A framework for Culturally-aware Image Captioning
- URL: http://arxiv.org/abs/2402.05374v3
- Date: Mon, 19 Aug 2024 00:52:51 GMT
- Title: CIC: A framework for Culturally-aware Image Captioning
- Authors: Youngsik Yun, Jihie Kim,
- Abstract summary: We propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures.
Inspired by methods combining visual modality and Large Language Models (LLMs), our framework generates questions based on cultural categories from images.
Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions.
- Score: 2.565964707090901
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image Captioning generates descriptive sentences from images using Vision-Language Pre-trained models (VLPs) such as BLIP, which has improved greatly. However, current methods lack the generation of detailed descriptive captions for the cultural elements depicted in the images, such as the traditional clothing worn by people from Asian cultural groups. In this paper, we propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures. Inspired by methods combining visual modality and Large Language Models (LLMs) through appropriate prompts, our framework (1) generates questions based on cultural categories from images, (2) extracts cultural visual elements from Visual Question Answering (VQA) using generated questions, and (3) generates culturally-aware captions using LLMs with the prompts. Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions when compared to the image captioning baseline based on VLPs. Resources can be found at https://shane3606.github.io/cic..
Related papers
- Benchmarking Vision Language Models for Cultural Understanding [31.898921287065242]
This study introduces CulturalVQA, a visual question-answering benchmark aimed at assessing Vision Language Models (VLMs)
We curate a collection of 2,378 image-question pairs with 1-5 answers per question representing cultures from 11 countries across 5 continents.
The questions probe understanding of various facets of culture such as clothing, food, drinks, rituals, and traditions.
arXiv Detail & Related papers (2024-07-15T17:21:41Z) - See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding [78.88461026069862]
Vision-language models (VLMs) can respond to queries about images in many languages.
We present a novel investigation that demonstrates and localizes Western bias in image understanding.
arXiv Detail & Related papers (2024-06-17T15:49:51Z) - Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding [28.490495656348187]
We offer the Pun Rebus Art dataset for art understanding deeply rooted in traditional Chinese culture.
We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages.
Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations.
arXiv Detail & Related papers (2024-06-14T16:52:00Z) - How Culturally Aware are Vision-Language Models? [0.8437187555622164]
Images from folklore genres, such as mythology, folk dance, cultural signs, and symbols, are vital to every culture.
Our research compares the performance of four popular vision-language models in identifying culturally specific information in such images.
We propose a new evaluation metric, Cultural Awareness Score (CAS), dedicated to measuring the degree of cultural awareness in image captions.
arXiv Detail & Related papers (2024-05-24T04:45:14Z) - CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection.
It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs.
We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z) - CultureBank: An Online Community-Driven Knowledge Base Towards Culturally Aware Language Technologies [53.2331634010413]
CultureBank is a knowledge base built upon users' self-narratives.
It contains 12K cultural descriptors sourced from TikTok and 11K from Reddit.
We offer recommendations for future culturally aware language technologies.
arXiv Detail & Related papers (2024-04-23T17:16:08Z) - CULTURE-GEN: Revealing Global Cultural Perception in Language Models through Natural Language Prompting [73.94059188347582]
We uncover culture perceptions of three SOTA models on 110 countries and regions on 8 culture-related topics through culture-conditioned generations.
We discover that culture-conditioned generation consist of linguistic "markers" that distinguish marginalized cultures apart from default cultures.
arXiv Detail & Related papers (2024-04-16T00:50:43Z) - An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance [53.974497865647336]
We take a first step towards translating images to make them culturally relevant.
We build three pipelines comprising state-of-the-art generative models to do the task.
We conduct a human evaluation of translated images to assess for cultural relevance and meaning preservation.
arXiv Detail & Related papers (2024-04-01T17:08:50Z) - Navigating Cultural Chasms: Exploring and Unlocking the Cultural POV of Text-To-Image Models [32.99865895211158]
We explore the cultural perception embedded in Text-To-Image (TTI) models by characterizing culture across three tiers.
We propose a comprehensive suite of evaluation techniques, including intrinsic evaluations using the CLIP space.
To bolster our research, we introduce the CulText2I dataset, derived from six diverse TTI models and spanning ten languages.
arXiv Detail & Related papers (2023-10-03T10:13:36Z) - On the Cultural Gap in Text-to-Image Generation [75.69755281031951]
One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data.
There is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images.
We propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture.
arXiv Detail & Related papers (2023-07-06T13:17:55Z) - Culture-to-Culture Image Translation and User Evaluation [0.0]
The article introduces the concept of image "culturization," which we define as the process of altering the brushstroke of cultural features"
We defined a pipeline for translating objects' images from a source to a target cultural domain based on state-of-the-art Generative Adversarial Networks.
We gathered data through an online questionnaire to test four hypotheses concerning the impact of images belonging to different cultural domains on Italian participants.
arXiv Detail & Related papers (2022-01-05T12:10:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.