Related papers: ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images

URL: http://arxiv.org/abs/2404.10652v1
Date: Tue, 16 Apr 2024 15:28:30 GMT
Title: ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Authors: Quan Van Nguyen, Dan Quang Tran, Huy Quang Pham, Thang Kien-Bao Nguyen, Nghia Hieu Nguyen, Kiet Van Nguyen, Ngan Luu-Thuy Nguyen,
Abstract summary: We introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images. We uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers.
Score: 1.2529442734851663
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Visual Question Answering (VQA) is a complicated task that requires the capability of simultaneously processing natural language and images. Initially, this task was researched, focusing on methods to help machines understand objects and scene contexts in images. However, some text appearing in the image that carries explicit information about the full content of the image is not mentioned. Along with the continuous development of the AI era, there have been many studies on the reading comprehension ability of VQA models in the world. As a developing country, conditions are still limited, and this task is still open in Vietnam. Therefore, we introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images, we call it ViTextVQA (\textbf{Vi}etnamese \textbf{Text}-based \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering dataset) which contains \textbf{over 16,000} images and \textbf{over 50,000} questions with answers. Through meticulous experiments with various state-of-the-art models, we uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers. This finding helped us significantly improve the performance of the baseline models on the ViTextVQA dataset. Our dataset is available at this \href{https://github.com/minhquan6203/ViTextVQA-Dataset}{link} for research purposes.

Related papers

VCR: A Task for Pixel-Level Complex Reasoning in Vision Language Models via Restoring Occluded Text [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images.<n>This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z)
Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension. We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images. We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z)
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering [58.92057773071854]
We introduce MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages. MTVQA is the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
arXiv Detail & Related papers (2024-05-20T12:35:01Z)
ViOCRVQA: Novel Benchmark Dataset and Vision Reader for Visual Question Answering by Understanding Vietnamese Text in Images [1.2529442734851663]
We introduce a novel dataset, ViOCRVQA (Vietnamese Optical Character Recognition - Visual Question Answering dataset), consisting of 28,000+ images and 120,000+ question-answer pairs. In this dataset, all the images contain text and questions about the information relevant to the text in the images. We deploy ideas from state-of-the-art methods proposed for English to conduct experiments on our dataset, revealing the challenges and difficulties inherent in a Vietnamese dataset.
arXiv Detail & Related papers (2024-04-29T03:17:47Z)
Making the V in Text-VQA Matter [1.2962828085662563]
Text-based VQA aims at answering questions by reading the text present in the images. Recent studies have shown that the question-answer pairs in the dataset are more focused on the text present in the image. The models trained on this dataset predict biased answers due to the lack of understanding of visual context.
arXiv Detail & Related papers (2023-08-01T05:28:13Z)
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering [86.38098280689027]
We introduce an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA) We present a comprehensive evaluation of existing text-to-image models using a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.)
arXiv Detail & Related papers (2023-03-21T14:41:02Z)
Look, Read and Ask: Learning to Ask Questions by Reading Text in Images [3.3972119795940525]
We present a novel problem of text-based visual question generation or TextVQG. To address TextVQG, we present an OCR consistent visual question generation model that Looks into the visual content, Reads the scene text, and Asks a relevant and meaningful natural language question.
arXiv Detail & Related papers (2022-11-23T13:52:46Z)
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image. We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z)
Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language. We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs. We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z)
Localize, Group, and Select: Boosting Text-VQA by Scene Text Modeling [12.233796960280944]
Text-VQA (Visual Question Answering) aims at question answering through reading text information in images. LOGOS is a novel model which attempts to tackle this problem from multiple aspects.
arXiv Detail & Related papers (2021-08-20T01:31:51Z)
VisualMRC: Machine Reading Comprehension on Document Images [4.057968826847943]
Given a question and a document image, a machine reads and comprehends texts in the image to answer the question in natural language. VisualMRC focuses more on developing natural language understanding and generation abilities. It contains 30,000+ pairs of a question and an abstractive answer for 10,000+ document images sourced from multiple domains of webpages.
arXiv Detail & Related papers (2021-01-27T09:03:06Z)
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. TAP explicitly incorporates scene text (generated from OCR engines) in pre-training. Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z)
TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.