ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering
- URL: http://arxiv.org/abs/2410.14132v2
- Date: Thu, 24 Oct 2024 03:53:35 GMT
- Title: ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering
- Authors: Nghia Hieu Nguyen, Tho Thanh Quan, Ngan Luu-Thuy Nguyen,
- Abstract summary: Main challenge of text-based VQA is exploiting the meaning and information from scene texts.
Recent studies tackled this challenge by considering the spatial information of scene texts in images.
We introduce a novel method that effectively exploits the information from scene texts written in Vietnamese.
- Score: 0.5803309695504829
- License:
- Abstract: Text-based VQA is a challenging task that requires machines to use scene texts in given images to yield the most appropriate answer for the given question. The main challenge of text-based VQA is exploiting the meaning and information from scene texts. Recent studies tackled this challenge by considering the spatial information of scene texts in images via embedding 2D coordinates of their bounding boxes. In this study, we follow the definition of meaning from linguistics to introduce a novel method that effectively exploits the information from scene texts written in Vietnamese. Experimental results show that our proposed method obtains state-of-the-art results on two large-scale Vietnamese Text-based VQA datasets. The implementation can be found at this link.
Related papers
- Scene-Text Grounding for Text-Based Video Question Answering [97.1112579979614]
Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and reliance on scene-text recognition.
We study Grounded TextVideoQA by forcing models to answer questions and interpret relevant scene-text regions.
arXiv Detail & Related papers (2024-09-22T05:13:11Z) - Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension.
We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images.
We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z) - ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images [1.2529442734851663]
We introduce the first large-scale dataset in Vietnamese specializing in the ability to understand text appearing in images.
We uncover the significance of the order in which tokens in OCR text are processed and selected to formulate answers.
arXiv Detail & Related papers (2024-04-16T15:28:30Z) - Orientation-Independent Chinese Text Recognition in Scene Images [61.34060587461462]
We take the first attempt to extract orientation-independent visual features by disentangling content and orientation information of text images.
Specifically, we introduce a Character Image Reconstruction Network (CIRN) to recover corresponding printed character images with disentangled content and orientation information.
arXiv Detail & Related papers (2023-09-03T05:30:21Z) - Story Visualization by Online Text Augmentation with Context Memory [64.86944645907771]
We propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation.
The proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision.
arXiv Detail & Related papers (2023-08-15T05:08:12Z) - Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation [1.9085074258303771]
We study the task of visually'' translating scene text from a source language to a target language.
Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image.
We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis.
arXiv Detail & Related papers (2023-08-06T05:23:25Z) - TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation [55.83319599681002]
Text-VQA aims at answering questions that require understanding the textual cues in an image.
We develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image.
arXiv Detail & Related papers (2022-08-03T02:18:09Z) - ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval [66.66400551173619]
We propose a full transformer architecture to unify cross-modal retrieval scenarios in a single $textbfVi$sion.
We develop dual contrastive learning losses to embed both image-text pairs and fusion-text pairs into a common cross-modal space.
Experimental results show that ViSTA outperforms other methods by at least $bf8.4%$ at Recall@1 for scene text aware retrieval task.
arXiv Detail & Related papers (2022-03-31T03:40:21Z) - VieSum: How Robust Are Transformer-based Models on Vietnamese
Summarization? [1.1379578593538398]
We investigate the robustness of transformer-based encoder-decoder architectures for Vietnamese abstractive summarization.
We validate the performance of the methods on two Vietnamese datasets.
arXiv Detail & Related papers (2021-10-08T17:10:31Z) - RUArt: A Novel Text-Centered Solution for Text-Based Visual Question
Answering [14.498144268367541]
We propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA.
We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness.
arXiv Detail & Related papers (2020-10-24T15:37:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.