VisTW: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan
- URL: http://arxiv.org/abs/2503.10427v2
- Date: Sat, 15 Mar 2025 01:32:58 GMT
- Title: VisTW: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan
- Authors: Zhi Rui Tam, Ya-Ting Pai, Yen-Wei Lee, Yun-Nung Chen,
- Abstract summary: This paper proposes a comprehensive evaluation benchmark for Visual Language Models (VLM) in Traditional Chinese.<n>Our evaluation suite, the first of its kind, contains two complementary components: VisTW-MCQ and VisTW-Dialogue.
- Score: 20.92636353621876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a comprehensive evaluation benchmark for Visual Language Models (VLM) in Traditional Chinese. Our evaluation suite, the first of its kind, contains two complementary components: (1) VisTW-MCQ, a collection of manually curated exam multi-choice questions from 21 academic subjects designed to test the broad knowledge and reasoning capabilities of VLMs; and (2) VisTW-Dialogue, an open dialogue benchmark comprising 131 image-question pairs manually created to evaluate VLMs' ability in free-form dialogue generation within Taiwanese cultural contexts. These benchmarks address a critical gap in the evaluation landscape, where existing benchmarks predominantly focus on English or Simplified Chinese, neglecting the unique linguistic and cultural aspects of Traditional Chinese used in regions like Taiwan and Hong Kong. Our analysis reveals significant performance differences across various VLMs and highlights specific challenges in processing Traditional Chinese visual content.
Related papers
- ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai [2.4295338216682456]
ThaiOCRBench is the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks.<n>We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems.<n>Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content.
arXiv Detail & Related papers (2025-11-06T15:57:39Z) - BLEnD-Vis: Benchmarking Multimodal Cultural Understanding in Vision Language Models [54.16874020794336]
We introduce BLEnD-Vis, a benchmark designed to evaluate the robustness of everyday cultural knowledge in vision-language models (VLMs)<n> BLEnD-Vis constructs 313 culturally grounded question templates spanning 16 regions and generates three aligned multiple-choice formats.<n>The resulting benchmark comprises 4,916 images and over 21,000 multiple-choice question (MCQ) instances, validated through human annotation.
arXiv Detail & Related papers (2025-10-13T09:10:05Z) - CDTP: A Large-Scale Chinese Data-Text Pair Dataset for Comprehensive Evaluation of Chinese LLMs [71.01843542502438]
We present a comprehensive benchmark for evaluating Chinese Large Language Models (CB-ECLLM)<n>CB-ECLLM is based on the newly constructed Chinese Data-Text Pair (CDTP) dataset.<n>CDTP comprises over 7 million aligned text pairs, each consisting of unstructured text coupled with one or more corresponding triples, alongside a total of 15 million triples spanning four critical domains.
arXiv Detail & Related papers (2025-10-07T15:33:52Z) - MMA-ASIA: A Multilingual and Multimodal Alignment Framework for Culturally-Grounded Evaluation [91.22008265721952]
MMA-ASIA centers on a human-curated, multilingual, and multimodally aligned benchmark covering 8 Asian countries and 10 languages.<n>This is the first dataset aligned at the input level across three modalities: text, image (visual question answering), and speech.<n>We propose a five-dimensional evaluation protocol that measures: (i) cultural-awareness disparities across countries, (ii) cross-lingual consistency, (iii) cross-modal consistency, (iv) cultural knowledge generalization, and (v) grounding validity.
arXiv Detail & Related papers (2025-10-07T14:12:12Z) - Disentangling Language and Culture for Evaluating Multilingual Large Language Models [48.06219053598005]
This paper introduces a Dual Evaluation Framework to comprehensively assess the multilingual capabilities of LLMs.<n>By decomposing the evaluation along the dimensions of linguistic medium and cultural context, this framework enables a nuanced analysis of LLMs' ability to process questions cross-lingually.
arXiv Detail & Related papers (2025-05-30T14:25:45Z) - Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation [20.87296508045343]
We introduce Fuxi, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks.
We reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks.
Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development.
arXiv Detail & Related papers (2025-03-20T04:26:40Z) - All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark [74.4821011648997]
MAIA is a benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos.<n>It evaluates Vision Language Models (VLMs) on two aligned tasks.<n>It considers twelve reasoning categories that aim to disentangle language and vision relations.
arXiv Detail & Related papers (2025-02-24T09:25:51Z) - Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities [27.940469021840745]
We present an evaluation protocol to assess the spatial reasoning capabilities of vision-language models (VLMs)
Despite some alignment with English conventions in resolving ambiguities, our experiments reveal significant shortcomings of VLMs.
With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning.
arXiv Detail & Related papers (2024-10-22T19:39:15Z) - Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models [8.746788828655356]
The rapid advancement of large language models (LLMs) has highlighted the need for robust evaluation frameworks.
We propose two key benchmarks: Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI)
arXiv Detail & Related papers (2024-10-07T07:14:37Z) - CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation [49.41531871253317]
We present a new Chinese Vision- Language Understanding Evaluation benchmark dataset.
The selection of object categories and images is entirely driven by Chinese native speakers.
We find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
arXiv Detail & Related papers (2024-07-01T08:35:37Z) - CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - Thai Winograd Schemas: A Benchmark for Thai Commonsense Reasoning [0.0]
This research introduces a collection of Winograds in Thai, a novel dataset designed to evaluate commonsense reasoning capabilities in the context of the Thai language.<n>We evaluate the performance of popular large language models on this benchmark, revealing their strengths, limitations, and providing insights into the current state-of-the-art.
arXiv Detail & Related papers (2024-05-28T17:14:02Z) - FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models [64.11333762954283]
This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs.
We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses.
Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities.
arXiv Detail & Related papers (2024-04-29T01:49:07Z) - Measuring Taiwanese Mandarin Language Understanding [24.581360653015423]
We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in large language models (LLMs)
TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels.
arXiv Detail & Related papers (2024-03-29T13:56:21Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.