Related papers: Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark

Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark

URL: http://arxiv.org/abs/2512.22218v1
Date: Mon, 22 Dec 2025 13:39:40 GMT
Title: Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark
Authors: Hieu Minh Nguyen, Tam Le-Thanh Dang, Kiet Van Nguyen,
Abstract summary: ViSignVQA is the first large-scale Vietnamese dataset designed for signboard-oriented VQA.<n>The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards.
Score: 5.3220011447194215
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding signboard text in natural scenes is essential for real-world applications of Visual Question Answering (VQA), yet remains underexplored, particularly in low-resource languages. We introduce ViSignVQA, the first large-scale Vietnamese dataset designed for signboard-oriented VQA, which comprises 10,762 images and 25,573 question-answer pairs. The dataset captures the diverse linguistic, cultural, and visual characteristics of Vietnamese signboards, including bilingual text, informal phrasing, and visual elements such as color and layout. To benchmark this task, we adapted state-of-the-art VQA models (e.g., BLIP-2, LaTr, PreSTU, and SaL) by integrating a Vietnamese OCR model (SwinTextSpotter) and a Vietnamese pretrained language model (ViT5). The experimental results highlight the significant role of the OCR-enhanced context, with F1-score improvements of up to 209% when the OCR text is appended to questions. Additionally, we propose a multi-agent VQA framework combining perception and reasoning agents with GPT-4, achieving 75.98% accuracy via majority voting. Our study presents the first large-scale multimodal dataset for Vietnamese signboard understanding. This underscores the importance of domain-specific resources in enhancing text-based VQA for low-resource languages. ViSignVQA serves as a benchmark capturing real-world scene text characteristics and supporting the development and evaluation of OCR-integrated VQA models in Vietnamese.

Related papers

KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts [5.689962668710347]
KRETA is a benchmark for Korean Reading and rEasoning in Text-rich VQA Attuned to diverse visual contexts.<n> KRETA facilitates an in-depth evaluation of both visual text understanding and reasoning capabilities, while also supporting a multifaceted assessment.<n>We introduce a semi-automated VQA generation pipeline specifically optimized for text-rich settings, leveraging refined stepwise image decomposition.
arXiv Detail & Related papers (2025-08-27T15:01:02Z)
Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension. We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images. We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z)
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering [57.30218240464696]
We introduce MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages.<n>MTVQA is the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
arXiv Detail & Related papers (2024-05-20T12:35:01Z)
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images [1.2529442734851663]
We introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs. In this dataset, all the images contain text and questions about the information relevant to the text in the images. We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset.
arXiv Detail & Related papers (2024-04-29T03:17:47Z)
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images [1.2529442734851663]
Visual Question Answerinng (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images.<n>We introduce the first large-scale dataset in Vietnamese specializing in the ability to understand scene text.
arXiv Detail & Related papers (2024-04-16T15:28:30Z)
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document [60.01330653769726]
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions. By expanding its capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability.
arXiv Detail & Related papers (2024-03-07T13:16:24Z)
VlogQA: Task, Dataset, and Baseline Models for Vietnamese Spoken-Based Machine Reading Comprehension [1.3942150186842373]
This paper presents the development process of a Vietnamese spoken language corpus for machine reading comprehension tasks.<n>The existing MRC corpora in Vietnamese mainly focus on formal written documents such as Wikipedia articles, online newspapers, or textbooks.<n>In contrast, the VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube.
arXiv Detail & Related papers (2024-02-05T00:54:40Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image. We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z)
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. TAP explicitly incorporates scene text (generated from OCR engines) in pre-training. Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.