Related papers: ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai

URL: http://arxiv.org/abs/2511.04479v2
Date: Fri, 07 Nov 2025 04:50:48 GMT
Title: ThaiOCRBench: A Task-Diverse Benchmark for Vision-Language Understanding in Thai
Authors: Surapon Nonesung, Teetouch Jaknamon, Sirinya Chaiophat, Natapong Nitarach, Chanakan Wittayasakpan, Warit Sirichotedumrong, Adisai Na-Thalang, Kunat Pipatanakul,
Abstract summary: ThaiOCRBench is the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks.<n>We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems.<n>Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content.
Score: 2.4295338216682456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present ThaiOCRBench, the first comprehensive benchmark for evaluating vision-language models (VLMs) on Thai text-rich visual understanding tasks. Despite recent progress in multimodal modeling, existing benchmarks predominantly focus on high-resource languages, leaving Thai underrepresented, especially in tasks requiring document structure understanding. ThaiOCRBench addresses this gap by offering a diverse, human-annotated dataset comprising 2,808 samples across 13 task categories. We evaluate a wide range of state-of-the-art VLMs in a zero-shot setting, spanning both proprietary and open-source systems. Results show a significant performance gap, with proprietary models (e.g., Gemini 2.5 Pro) outperforming open-source counterparts. Notably, fine-grained text recognition and handwritten content extraction exhibit the steepest performance drops among open-source models. Through detailed error analysis, we identify key challenges such as language bias, structural mismatch, and hallucinated content. ThaiOCRBench provides a standardized framework for assessing VLMs in low-resource, script-complex settings, and provides actionable insights for improving Thai-language document understanding.

Related papers

Multimodal Evaluation of Russian-language Architectures [88.00147763684451]
We introduce Mera Multi, an open multimodal evaluation framework for Russian-spoken architectures.<n>The benchmark is instruction-based and encompasses default text, image, audio, and video modalities.<n>Mera Multi provides a replicable methodology for constructing multimodal benchmarks in typologically diverse languages.
arXiv Detail & Related papers (2025-11-19T15:43:53Z)
FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model [11.423111315561151]
FG-CLIP 2 is a vision-language model designed to advance fine-grained alignment for both English and Chinese.<n>Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling.<n>We present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification.
arXiv Detail & Related papers (2025-10-13T02:32:07Z)
Decoding Memes: Benchmarking Narrative Role Classification across Multilingual and Multimodal Models [26.91963265869296]
This work investigates the challenging task of identifying narrative roles in Internet memes.<n>It builds on an annotated dataset originally skewed toward the 'Other' class.<n> Comprehensive lexical and structural analyses highlight the nuanced, culture-specific, and context-rich language used in real memes.
arXiv Detail & Related papers (2025-06-29T07:12:11Z)
Towards Explainable Bilingual Multimodal Misinformation Detection and Localization [64.37162720126194]
BiMi is a framework that jointly performs region-level localization, cross-modal and cross-lingual consistency detection, and natural language explanation for misinformation analysis.<n>BiMiBench is a benchmark constructed by systematically editing real news images and subtitles.<n>BiMi outperforms strong baselines by up to +8.9 in classification accuracy, +15.9 in localization accuracy, and +2.5 in explanation BERTScore.
arXiv Detail & Related papers (2025-06-28T15:43:06Z)
Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation [45.551223552275424]
Vision-Language Translation is a challenging task that requires accurately recognizing multilingual text embedded in images.<n>We present a comprehensive study of VLT from three key perspectives: data quality, model architecture, and evaluation metrics.
arXiv Detail & Related papers (2025-06-13T14:23:38Z)
A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning [27.350370419751385]
Remote Sensing Image Captioning (RSIC) is a cross-modal field bridging vision and language, aimed at automatically generating natural language descriptions of features and scenes in remote sensing imagery.<n>Two critical challenges persist: the scarcity of non-English descriptive datasets and the lack of multilingual capability evaluation for models.<n>This paper introduces and analyzes BRSIC, a comprehensive bilingual dataset that enriches three established English RSIC datasets with Chinese descriptions, encompassing 13,634 images paired with 68,170 bilingual captions.
arXiv Detail & Related papers (2025-03-06T16:31:34Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning. We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM) Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.