Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding
- URL: http://arxiv.org/abs/2406.10318v1
- Date: Fri, 14 Jun 2024 16:52:00 GMT
- Title: Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding
- Authors: Tuo Zhang, Tiantian Feng, Yibin Ni, Mengqin Cao, Ruying Liu, Katharine Butler, Yanjun Weng, Mi Zhang, Shrikanth S. Narayanan, Salman Avestimehr,
- Abstract summary: We offer the Pun Rebus Art dataset for art understanding deeply rooted in traditional Chinese culture.
We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages.
Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations.
- Score: 28.490495656348187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for art understanding deeply rooted in traditional Chinese culture. We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages. Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations and showing limited improvement through in-context learning. By releasing the Pun Rebus Art Dataset, we aim to facilitate the development of VLMs that can better understand and interpret culturally specific content, promoting greater inclusiveness beyond English-based corpora.
Related papers
- Diffusion Models Through a Global Lens: Are They Culturally Inclusive? [15.991121392458748]
We introduce CultDiff benchmark, evaluating state-of-the-art diffusion models.
We show that these models often fail to generate cultural artifacts in architecture, clothing, and food, especially for underrepresented country regions.
We develop a neural-based image-image similarity metric, namely, CultDiff-S, to predict human judgment on real and generated images with cultural artifacts.
arXiv Detail & Related papers (2025-02-13T03:05:42Z) - CognArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements [1.0579965347526206]
Art, as a universal language, can be interpreted in diverse ways.
Large Language Models (LLMs) and the availability of Multimodal Large Language Models (MLLMs) raise the question of how these models can be used to assess and interpret artworks.
arXiv Detail & Related papers (2025-02-04T18:08:23Z) - CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries [63.00147630084146]
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding.
CultureVerse is a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types.
We propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding.
arXiv Detail & Related papers (2025-01-02T14:42:37Z) - Understanding the World's Museums through Vision-Language Reasoning [49.976422699906706]
Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions.
We collect and curate a large-scale dataset of 65M images and 200M question-answer pairs in the standard museum catalog format for exhibits from all around the world.
We train two VLMs from different categories: the BLIP model, with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities.
arXiv Detail & Related papers (2024-12-02T10:54:31Z) - See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding [78.88461026069862]
Vision-language models (VLMs) can respond to queries about images in many languages.
We present a novel investigation that demonstrates and localizes Western bias in image understanding.
arXiv Detail & Related papers (2024-06-17T15:49:51Z) - How Culturally Aware are Vision-Language Models? [0.8437187555622164]
Images from folklore genres, such as mythology, folk dance, cultural signs, and symbols, are vital to every culture.
Our research compares the performance of four popular vision-language models in identifying culturally specific information in such images.
We propose a new evaluation metric, the Cultural Awareness Score (CAS), which measures the degree of cultural awareness in image captions.
arXiv Detail & Related papers (2024-05-24T04:45:14Z) - Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding.
This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z) - An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance [53.974497865647336]
We take a first step towards translating images to make them culturally relevant.
We build three pipelines comprising state-of-the-art generative models to do the task.
We conduct a human evaluation of translated images to assess for cultural relevance and meaning preservation.
arXiv Detail & Related papers (2024-04-01T17:08:50Z) - Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition.
Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages.
Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z) - CIC: A Framework for Culturally-Aware Image Captioning [2.565964707090901]
We propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures.
Inspired by methods combining visual modality and Large Language Models (LLMs), our framework generates questions based on cultural categories from images.
Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions.
arXiv Detail & Related papers (2024-02-08T03:12:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.