Related papers: VisTW: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan

VisTW: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan

URL: http://arxiv.org/abs/2503.10427v2
Date: Sat, 15 Mar 2025 01:32:58 GMT
Title: VisTW: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan
Authors: Zhi Rui Tam, Ya-Ting Pai, Yen-Wei Lee, Yun-Nung Chen,
Abstract summary: This paper proposes a comprehensive evaluation benchmark for Visual Language Models (VLM) in Traditional Chinese.<n>Our evaluation suite, the first of its kind, contains two complementary components: VisTW-MCQ and VisTW-Dialogue.
Score: 20.92636353621876
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose a comprehensive evaluation benchmark for Visual Language Models (VLM) in Traditional Chinese. Our evaluation suite, the first of its kind, contains two complementary components: (1) VisTW-MCQ, a collection of manually curated exam multi-choice questions from 21 academic subjects designed to test the broad knowledge and reasoning capabilities of VLMs; and (2) VisTW-Dialogue, an open dialogue benchmark comprising 131 image-question pairs manually created to evaluate VLMs' ability in free-form dialogue generation within Taiwanese cultural contexts. These benchmarks address a critical gap in the evaluation landscape, where existing benchmarks predominantly focus on English or Simplified Chinese, neglecting the unique linguistic and cultural aspects of Traditional Chinese used in regions like Taiwan and Hong Kong. Our analysis reveals significant performance differences across various VLMs and highlights specific challenges in processing Traditional Chinese visual content.

Related papers

Fùxì: A Benchmark for Evaluating Language Models on Ancient Chinese Text Understanding and Generation [20.87296508045343]
We introduce Fuxi, a comprehensive benchmark that evaluates both understanding and generation capabilities across 21 diverse tasks. We reveal significant performance gaps between understanding and generation tasks, with models achieving promising results in comprehension but struggling considerably in generation tasks. Our findings highlight the current limitations in ancient Chinese text processing and provide insights for future model development.
arXiv Detail & Related papers (2025-03-20T04:26:40Z)
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark [74.4821011648997]
MAIA is a benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos.<n>It evaluates Vision Language Models (VLMs) on two aligned tasks.<n>It considers twelve reasoning categories that aim to disentangle language and vision relations.
arXiv Detail & Related papers (2025-02-24T09:25:51Z)
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities [27.940469021840745]
We present an evaluation protocol to assess the spatial reasoning capabilities of vision-language models (VLMs) Despite some alignment with English conventions in resolving ambiguities, our experiments reveal significant shortcomings of VLMs. With a growing effort to align vision-language models with human cognitive intuitions, we call for more attention to the ambiguous nature and cross-cultural diversity of spatial reasoning.
arXiv Detail & Related papers (2024-10-22T19:39:15Z)
Representing the Under-Represented: Cultural and Core Capability Benchmarks for Developing Thai Large Language Models [8.746788828655356]
The rapid advancement of large language models (LLMs) has highlighted the need for robust evaluation frameworks. We propose two key benchmarks: Thai-H6 and Thai Cultural and Linguistic Intelligence Benchmark (ThaiCLI)
arXiv Detail & Related papers (2024-10-07T07:14:37Z)
CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation [49.41531871253317]
We present a new Chinese Vision- Language Understanding Evaluation benchmark dataset. The selection of object categories and images is entirely driven by Chinese native speakers. We find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
arXiv Detail & Related papers (2024-07-01T08:35:37Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
Thai Winograd Schemas: A Benchmark for Thai Commonsense Reasoning [0.0]
This research introduces a collection of Winograds in Thai, a novel dataset designed to evaluate commonsense reasoning capabilities in the context of the Thai language.<n>We evaluate the performance of popular large language models on this benchmark, revealing their strengths, limitations, and providing insights into the current state-of-the-art.
arXiv Detail & Related papers (2024-05-28T17:14:02Z)
FoundaBench: Evaluating Chinese Fundamental Knowledge Capabilities of Large Language Models [64.11333762954283]
This paper introduces FoundaBench, a pioneering benchmark designed to rigorously evaluate the fundamental knowledge capabilities of Chinese LLMs. We present an extensive evaluation of 12 state-of-the-art LLMs using FoundaBench, employing both traditional assessment methods and our CircularEval protocol to mitigate potential biases in model responses. Our results highlight the superior performance of models pre-trained on Chinese corpora, and reveal a significant disparity between models' reasoning and memory recall capabilities.
arXiv Detail & Related papers (2024-04-29T01:49:07Z)
Measuring Taiwanese Mandarin Language Understanding [24.581360653015423]
We present TMLU, a holistic evaluation suit tailored for assessing the advanced knowledge and reasoning capability in large language models (LLMs) TMLU consists of an array of 37 subjects across social science, STEM, humanities, Taiwan-specific content, and others, ranging from middle school to professional levels.
arXiv Detail & Related papers (2024-03-29T13:56:21Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.