KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
- URL: http://arxiv.org/abs/2508.19944v2
- Date: Sun, 31 Aug 2025 10:33:09 GMT
- Title: KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
- Authors: Taebaek Hwang, Minseo Kim, Gisang Lee, Seonuk Kim, Hyunjun Eun,
- Abstract summary: KRETA is a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts.<n> KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment.<n>We introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition.
- Score: 5.689962668710347
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding and reasoning over text within visual contexts poses a significant challenge for Vision-Language Models (VLMs), given the complexity and diversity of real-world scenarios. To address this challenge, text-rich Visual Question Answering (VQA) datasets and benchmarks have emerged for high-resource languages like English. However, a critical gap persists for low-resource languages such as Korean, where the lack of comprehensive benchmarks hinders robust model evaluation and comparison. To bridge this gap, we introduce KRETA, a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts. KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment across 15 domains and 26 image types. Additionally, we introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition and a rigorous seven-metric evaluation protocol to ensure data quality. While KRETA is tailored for Korean, we hope our adaptable and extensible pipeline will facilitate the development of similar benchmarks in other languages, thereby accelerating multilingual VLM research. The code and dataset for KRETA are available at https://github.com/tabtoyou/KRETA.
Related papers
- VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? [51.02924254085878]
Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs.<n>We introduce VISTA-Bench, a benchmark from multimodal perception, reasoning, to unimodal understanding domains.
arXiv Detail & Related papers (2026-02-04T17:48:55Z) - Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark [5.3220011447194215]
ViSignVQA is the first large-scale Vietnamese dataset designed for signboard-oriented VQA.<n>The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards.
arXiv Detail & Related papers (2025-12-22T13:39:40Z) - DRISHTIKON: Visual Grounding at Multiple Granularities in Documents [21.376466879737855]
DRISHTIKON is a multi-granular and multi-block visual grounding framework.<n>Our approach integrates multilingual OCR, large language models, and a novel region matching algorithm to localize answer spans.<n>Our findings pave the way for more robust and interpretable document understanding systems.
arXiv Detail & Related papers (2025-06-26T14:32:23Z) - Evaluating Menu OCR and Translation: A Benchmark for Aligning Human and Automated Evaluations in Large Vision-Language Models [44.159383734605456]
We propose a specialized evaluation framework emphasizing the pivotal role of menu translation in cross-cultural communication.<n>MOTBench requires LVLMs to accurately recognize and translate each dish, along with its price and unit items on a menu, along with precise human annotations.<n>Our benchmark is comprised of a collection of Chinese and English menus, characterized by intricate layouts, a variety of fonts, and culturally specific elements across different languages, along with precise human annotations.
arXiv Detail & Related papers (2025-04-16T03:08:57Z) - Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective [42.69954782425797]
Large Vision-Language Models (LVLMs) have shown promising reasoning capabilities on text-rich images from charts, tables, and documents.<n>This raises the need to evaluate LVLM performance on cross-lingual text-rich visual inputs, where the language in the image differs from the language of the instructions.<n>We introduce XT-VQA (Cross-Lingual Text-Rich Visual Question Answering), a benchmark designed to assess how LVLMs handle language inconsistency between image text and questions.
arXiv Detail & Related papers (2024-12-23T18:48:04Z) - Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension.
We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images.
We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z) - MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering [57.30218240464696]
We introduce MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages.<n>MTVQA is the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
arXiv Detail & Related papers (2024-05-20T12:35:01Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - A Benchmark for Chinese-English Scene Text Image Super-resolution [15.042152725255171]
Scene Text Image Super-resolution (STISR) aims to recover high-resolution (HR) scene text images with visually pleasant and readable text content from low-resolution (LR) input.
Most existing works focus on recovering English texts, which have relatively simple character structures.
We propose a real-world Chinese-English benchmark dataset, namely Real-CE, for the task of STISR.
arXiv Detail & Related papers (2023-08-07T02:57:48Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.