CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
- URL: http://arxiv.org/abs/2407.01081v1
- Date: Mon, 1 Jul 2024 08:35:37 GMT
- Title: CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
- Authors: Yuxuan Wang, Yijun Liu, Fei Yu, Chen Huang, Kexin Li, Zhiguo Wan, Wanxiang Che,
- Abstract summary: We present a new Chinese Vision- Language Understanding Evaluation benchmark dataset.
The selection of object categories and images is entirely driven by Chinese native speakers.
We find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
- Score: 49.41531871253317
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite the rapid development of Chinese vision-language models (VLMs), most existing Chinese vision-language (VL) datasets are constructed on Western-centric images from existing English VL datasets. The cultural bias in the images makes these datasets unsuitable for evaluating VLMs in Chinese culture. To remedy this issue, we present a new Chinese Vision- Language Understanding Evaluation (CVLUE) benchmark dataset, where the selection of object categories and images is entirely driven by Chinese native speakers, ensuring that the source images are representative of Chinese culture. The benchmark contains four distinct VL tasks ranging from image-text retrieval to visual question answering, visual grounding and visual dialogue. We present a detailed statistical analysis of CVLUE and provide a baseline performance analysis with several open-source multilingual VLMs on CVLUE and its English counterparts to reveal their performance gap between English and Chinese. Our in-depth category-level analysis reveals a lack of Chinese cultural knowledge in existing VLMs. We also find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
Related papers
- Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration [31.684544472009918]
We propose a semi-grained pipeline for constructing cultural VLM benchmarks.
VLM models generate questions based on guidelines, human-annotated examples, and image-wise relevant knowledge.
This pipeline is demonstrated through a specific application: creating a dataset tailored to Korean culture, dubbed K-Viscuit.
arXiv Detail & Related papers (2024-06-24T09:18:15Z) - See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding [78.88461026069862]
Vision-language models (VLMs) can respond to queries about images in many languages.
We present a novel investigation that demonstrates and localizes Western bias in image understanding.
arXiv Detail & Related papers (2024-06-17T15:49:51Z) - CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.40505206535077]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 28 countries on four continents, covering 26 languages with 11 scripts, providing a total of 9k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases [87.65903426052155]
We perform a large-scale transfer learning experiment aimed at discovering latent vision-language skills from data.
We show that generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths.
We present a new dataset, OLIVE, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested.
arXiv Detail & Related papers (2024-04-03T02:40:35Z) - Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models [25.088717058818528]
We introduce nine vision-and-language (VL) tasks and construct multilingual visual-text datasets in four languages: English, Japanese, Swahili, and Urdu.
Our work is the first to conduct such analyses in Swahili and Urdu. Also, it introduces textitrationales in VL analysis, which played a vital role in the evaluation.
arXiv Detail & Related papers (2024-03-29T10:53:07Z) - COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning [57.600941792026006]
We introduce COIG-CQIA, a high-quality Chinese instruction tuning dataset.
Our aim is to build a diverse, wide-ranging instruction-tuning dataset to better align model behavior with human interactions.
We train models of various scales on different subsets of CQIA, following in-depth evaluation and analyses.
arXiv Detail & Related papers (2024-03-26T19:24:18Z) - ICU: Conquering Language Barriers in Vision-and-Language Modeling by
Dividing the Tasks into Image Captioning and Language Understanding [1.9906814758497542]
ICU, Image Caption Understanding, divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM) takes the caption as the alt text and performs cross-lingual language understanding.
We show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
arXiv Detail & Related papers (2023-10-19T07:11:48Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.