Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
- URL: http://arxiv.org/abs/2406.16469v2
- Date: Tue, 17 Dec 2024 05:51:01 GMT
- Title: Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
- Authors: Yujin Baek, ChaeHun Park, Jaeseok Kim, Yu-Jung Heo, Du-Seong Chang, Jaegul Choo,
- Abstract summary: We propose a semi-automated framework for constructing cultural VLM benchmarks, specifically targeting multiple-choice QA.
This framework combines human-VLM collaboration, where VLMs generate questions based on guidelines, a small set of annotated examples, and relevant knowledge, followed by a verification process by native speakers.
We demonstrate the effectiveness of this framework through the creation of K-Viscuit, a dataset focused on Korean culture.
- Score: 31.684544472009918
- License:
- Abstract: To create culturally inclusive vision-language models (VLMs), developing a benchmark that tests their ability to address culturally relevant questions is essential. Existing approaches typically rely on human annotators, making the process labor-intensive and creating a cognitive burden in generating diverse questions. To address this, we propose a semi-automated framework for constructing cultural VLM benchmarks, specifically targeting multiple-choice QA. This framework combines human-VLM collaboration, where VLMs generate questions based on guidelines, a small set of annotated examples, and relevant knowledge, followed by a verification process by native speakers. We demonstrate the effectiveness of this framework through the creation of K-Viscuit, a dataset focused on Korean culture. Our experiments on this dataset reveal that open-source models lag behind proprietary ones in understanding Korean culture, highlighting key areas for improvement. We also present a series of further analyses, including human evaluation, augmenting VLMs with external knowledge, and the evaluation beyond multiple-choice QA. Our dataset is available at https://huggingface.co/datasets/ddehun/k-viscuit.
Related papers
- CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries [63.00147630084146]
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding.
CultureVerse is a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types.
We propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding.
arXiv Detail & Related papers (2025-01-02T14:42:37Z) - CROPE: Evaluating In-Context Adaptation of Vision and Language Models to Culture-Specific Concepts [45.77570690529597]
We introduce CROPE, a visual question answering benchmark designed to probe the knowledge of culture-specific concepts.
Our evaluation of several state-of-the-art open Vision and Language models shows large performance disparities between culture-specific and common concepts.
Experiments with contextual knowledge indicate that models struggle to effectively utilize multimodal information and bind culture-specific concepts to their depictions.
arXiv Detail & Related papers (2024-10-20T17:31:19Z) - Benchmarking Cognitive Domains for LLMs: Insights from Taiwanese Hakka Culture [4.467334566487944]
This study introduces a benchmark designed to evaluate the performance of large language models (LLMs) in understanding and processing cultural knowledge.
The study develops a multi-dimensional framework that systematically assesses LLMs across six cognitive domains: Remembering, Understanding, Applying, Analyzing, evaluating, and Creating.
The results highlight the effectiveness of RAG in improving accuracy across all cognitive domains, particularly in tasks requiring precise retrieval and application of cultural knowledge.
arXiv Detail & Related papers (2024-09-03T02:50:04Z) - CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation [49.41531871253317]
We present a new Chinese Vision- Language Understanding Evaluation benchmark dataset.
The selection of object categories and images is entirely driven by Chinese native speakers.
We find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
arXiv Detail & Related papers (2024-07-01T08:35:37Z) - CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models [59.22460740026037]
"CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal impacts" dataset is designed to evaluate the social and cultural variation of Large Language Models (LLMs)
We create a hand-crafted, multilingual dataset of value-laden prompts which address specific socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy.
arXiv Detail & Related papers (2024-05-22T20:19:10Z) - No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models [38.932610459192105]
We study cultural and socioeconomic diversity in contrastive vision-language models (VLMs)
Our work underscores the value of using diverse data to create more inclusive multimodal systems.
arXiv Detail & Related papers (2024-05-22T16:04:22Z) - CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge [69.82940934994333]
We introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build challenging evaluation dataset.
Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions.
CULTURALBENCH-V0.1 is a compact yet high-quality evaluation dataset with users' red-teaming attempts.
arXiv Detail & Related papers (2024-04-10T00:25:09Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.