Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding
- URL: http://arxiv.org/abs/2406.10318v1
- Date: Fri, 14 Jun 2024 16:52:00 GMT
- Title: Creating a Lens of Chinese Culture: A Multimodal Dataset for Chinese Pun Rebus Art Understanding
- Authors: Tuo Zhang, Tiantian Feng, Yibin Ni, Mengqin Cao, Ruying Liu, Katharine Butler, Yanjun Weng, Mi Zhang, Shrikanth S. Narayanan, Salman Avestimehr,
- Abstract summary: We offer the Pun Rebus Art dataset for art understanding deeply rooted in traditional Chinese culture.
We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages.
Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations.
- Score: 28.490495656348187
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision-language models (VLMs) have demonstrated remarkable abilities in understanding everyday content. However, their performance in the domain of art, particularly culturally rich art forms, remains less explored. As a pearl of human wisdom and creativity, art encapsulates complex cultural narratives and symbolism. In this paper, we offer the Pun Rebus Art Dataset, a multimodal dataset for art understanding deeply rooted in traditional Chinese culture. We focus on three primary tasks: identifying salient visual elements, matching elements with their symbolic meanings, and explanations for the conveyed messages. Our evaluation reveals that state-of-the-art VLMs struggle with these tasks, often providing biased and hallucinated explanations and showing limited improvement through in-context learning. By releasing the Pun Rebus Art Dataset, we aim to facilitate the development of VLMs that can better understand and interpret culturally specific content, promoting greater inclusiveness beyond English-based corpora.
Related papers
- CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts [45.77570690529597]
We introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts.
Our evaluation of several state-of-the-art open Vision and Language models shows large performance disparities between culture-specific and common concepts.
Experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions.
arXiv Detail & Related papers (2024-10-20T17:31:19Z) - LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models [60.67899965748755]
We present LLaVA-Read, a multimodal large language model that utilizes dual visual encoders along with a visual text encoder.
Our research suggests visual text understanding remains an open challenge and an efficient visual text encoder is crucial for future successful multimodal systems.
arXiv Detail & Related papers (2024-07-27T05:53:37Z) - See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding [78.88461026069862]
Vision-language models (VLMs) can respond to queries about images in many languages.
We present a novel investigation that demonstrates and localizes Western bias in image understanding.
arXiv Detail & Related papers (2024-06-17T15:49:51Z) - How Culturally Aware are Vision-Language Models? [0.8437187555622164]
Images from folklore genres, such as mythology, folk dance, cultural signs, and symbols, are vital to every culture.
Our research compares the performance of four popular vision-language models in identifying culturally specific information in such images.
We propose a new evaluation metric, Cultural Awareness Score (CAS), dedicated to measuring the degree of cultural awareness in image captions.
arXiv Detail & Related papers (2024-05-24T04:45:14Z) - Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense [98.09670425244462]
Large language models (LLMs) have demonstrated substantial commonsense understanding.
This paper examines the capabilities and limitations of several state-of-the-art LLMs in the context of cultural commonsense tasks.
arXiv Detail & Related papers (2024-05-07T20:28:34Z) - An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance [53.974497865647336]
We take a first step towards translating images to make them culturally relevant.
We build three pipelines comprising state-of-the-art generative models to do the task.
We conduct a human evaluation of translated images to assess for cultural relevance and meaning preservation.
arXiv Detail & Related papers (2024-04-01T17:08:50Z) - Artwork Explanation in Large-scale Vision Language Models [32.645448509968226]
Large-scale vision-language models (LVLMs) output text from images and instructions, demonstrating advanced capabilities in text generation and comprehension.
We propose a new task: the artwork explanation generation task, along with its evaluation dataset and metric.
It consists of two parts: generating explanations from both images and titles of artworks, and generating explanations using only images.
Our findings indicate that LVLMs not only struggle with integrating language and visual information but also exhibit a more pronounced limitation in acquiring knowledge from images alone.
arXiv Detail & Related papers (2024-02-29T19:01:03Z) - Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition.
Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages.
Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z) - CIC: A framework for Culturally-aware Image Captioning [2.565964707090901]
We propose a new framework, Culturally-aware Image Captioning (CIC), that generates captions and describes cultural elements extracted from cultural visual elements in images representing cultures.
Inspired by methods combining visual modality and Large Language Models (LLMs), our framework generates questions based on cultural categories from images.
Our human evaluation conducted on 45 participants from 4 different cultural groups with a high understanding of the corresponding culture shows that our proposed framework generates more culturally descriptive captions.
arXiv Detail & Related papers (2024-02-08T03:12:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.