Related papers: CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation

URL: http://arxiv.org/abs/2407.01081v1
Date: Mon, 1 Jul 2024 08:35:37 GMT
Title: CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation
Authors: Yuxuan Wang, Yijun Liu, Fei Yu, Chen Huang, Kexin Li, Zhiguo Wan, Wanxiang Che,
Abstract summary: We present a new Chinese Vision- Language Understanding Evaluation benchmark dataset. The selection of object categories and images is entirely driven by Chinese native speakers. We find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
Score: 49.41531871253317
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Despite the rapid development of Chinese vision-language models (VLMs), most existing Chinese vision-language (VL) datasets are constructed on Western-centric images from existing English VL datasets. The cultural bias in the images makes these datasets unsuitable for evaluating VLMs in Chinese culture. To remedy this issue, we present a new Chinese Vision- Language Understanding Evaluation (CVLUE) benchmark dataset, where the selection of object categories and images is entirely driven by Chinese native speakers, ensuring that the source images are representative of Chinese culture. The benchmark contains four distinct VL tasks ranging from image-text retrieval to visual question answering, visual grounding and visual dialogue. We present a detailed statistical analysis of CVLUE and provide a baseline performance analysis with several open-source multilingual VLMs on CVLUE and its English counterparts to reveal their performance gap between English and Chinese. Our in-depth category-level analysis reveals a lack of Chinese cultural knowledge in existing VLMs. We also find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.

Related papers

Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation [45.551223552275424]
Vision-Language Translation is a challenging task that requires accurately recognizing multilingual text embedded in images.<n>We present a comprehensive study of VLT from three key perspectives: data quality, model architecture, and evaluation metrics.
arXiv Detail & Related papers (2025-06-13T14:23:38Z)
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs [13.069833806549914]
We propose the Traditional Chinese Culture understanding Benchmark (TCC-Bench) for assessing the understanding of traditional Chinese culture.<n>TCC-Bench comprises culturally rich and visually diverse data, incorporating images from museum artifacts, everyday life scenes, comics, and other culturally significant contexts.<n>We adopt a semi-automated pipeline that utilizes GPT-4o in text-only mode to generate candidate questions, followed by human curation to ensure data quality and avoid potential data leakage.
arXiv Detail & Related papers (2025-05-16T14:10:41Z)
VisTW: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan [20.92636353621876]
This paper proposes a comprehensive evaluation benchmark for Visual Language Models (VLM) in Traditional Chinese. Our evaluation suite, the first of its kind, contains two complementary components: VisTW-MCQ and VisTW-Dialogue.
arXiv Detail & Related papers (2025-03-13T14:49:35Z)
ChineseSimpleVQA -- "See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models [38.921977141721605]
We introduce the first factuality-based visual question-answering benchmark in Chinese, named ChineseSimpleVQA. Key features of this benchmark include a focus on the Chinese language, diverse knowledge types, a multi-hop question construction, high-quality data, static consistency, and easy-to-evaluate through short answers.
arXiv Detail & Related papers (2025-02-17T12:02:23Z)
Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective [42.69954782425797]
Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents. This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the instructions. We introduce XT-VQA (Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions.
arXiv Detail & Related papers (2024-12-23T18:48:04Z)
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines [74.25764182510295]
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English. We introduce World Cuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points.
arXiv Detail & Related papers (2024-10-16T16:11:49Z)
The Impact of Visual Information in Chinese Characters: Evaluating Large Models' Ability to Recognize and Utilize Radicals [17.24821720084663]
We evaluate Large Language Models' and Vision-Language Models' understanding of visual elements in Chinese characters. Our results reveal that models surprisingly exhibit some, but still limited, knowledge of the visual information. We observe consistent improvement in Part-Of-Speech tagging when providing additional information about radicals.
arXiv Detail & Related papers (2024-10-11T17:30:02Z)
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration [31.684544472009918]
We propose a semi-grained pipeline for constructing cultural VLM benchmarks. VLM models generate questions based on guidelines, human-annotated examples, and image-wise relevant knowledge. This pipeline is demonstrated through a specific application: creating a dataset tailored to Korean culture, dubbed K-Viscuit.
arXiv Detail & Related papers (2024-06-24T09:18:15Z)
See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding [78.88461026069862]
Vision-language models (VLMs) can respond to queries about images in many languages. We present a novel investigation that demonstrates and localizes Western bias in image understanding.
arXiv Detail & Related papers (2024-06-17T15:49:51Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases [87.65903426052155]
We perform a large-scale transfer learning experiment aimed at discovering latent vision-language skills from data. We show that generation tasks suffer from a length bias, suggesting benchmarks should balance tasks with varying output lengths. We present a new dataset, OLIVE, which simulates user instructions in the wild and presents challenges dissimilar to all datasets we tested.
arXiv Detail & Related papers (2024-04-03T02:40:35Z)
Constructing Multilingual Visual-Text Datasets Revealing Visual Multilingual Ability of Vision Language Models [25.088717058818528]
We introduce nine vision-and-language (VL) tasks and construct multilingual visual-text datasets in four languages: English, Japanese, Swahili, and Urdu. Our work is the first to conduct such analyses in Swahili and Urdu. Also, it introduces textitrationales in VL analysis, which played a vital role in the evaluation.
arXiv Detail & Related papers (2024-03-29T10:53:07Z)
ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding [1.9906814758497542]
ICU, Image Caption Understanding, divides a V&L task into two stages: a V&L model performs image captioning in English, and a multilingual language model (mLM) takes the caption as the alt text and performs cross-lingual language understanding. We show that ICU can achieve new state-of-the-art results for five languages, and comparable results for the rest.
arXiv Detail & Related papers (2023-10-19T07:11:48Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.