MUST-VQA: MUltilingual Scene-text VQA
- URL: http://arxiv.org/abs/2209.06730v1
- Date: Wed, 14 Sep 2022 15:37:56 GMT
- Title: MUST-VQA: MUltilingual Scene-text VQA
- Authors: Emanuele Vivoli, Ali Furkan Biten, Andres Mafla, Dimosthenis Karatzas,
Lluis Gomez
- Abstract summary: We consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages.
We show the effectiveness of adapting multilingual language models into STVQA tasks.
- Score: 7.687215328455748
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a framework for Multilingual Scene Text Visual
Question Answering that deals with new languages in a zero-shot fashion.
Specifically, we consider the task of Scene Text Visual Question Answering
(STVQA) in which the question can be asked in different languages and it is not
necessarily aligned to the scene text language. Thus, we first introduce a
natural step towards a more generalized version of STVQA: MUST-VQA. Accounting
for this, we discuss two evaluation scenarios in the constrained setting,
namely IID and zero-shot and we demonstrate that the models can perform on a
par on a zero-shot setting. We further provide extensive experimentation and
show the effectiveness of adapting multilingual language models into STVQA
tasks.
Related papers
- MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering [58.92057773071854]
We introduce MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
MTVQA is the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
arXiv Detail & Related papers (2024-05-20T12:35:01Z) - Zero-shot Translation of Attention Patterns in VQA Models to Natural
Language [65.94419474119162]
ZS-A2T is a framework that translates the transformer attention of a given model into natural language without requiring any training.
We consider this in the context of Visual Question Answering (VQA)
Our framework does not require any training and allows the drop-in replacement of different guiding sources.
arXiv Detail & Related papers (2023-11-08T22:18:53Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Applying Multilingual Models to Question Answering (QA) [0.0]
We study the performance of monolingual and multilingual language models on the task of question-answering (QA) on three diverse languages: English, Finnish and Japanese.
We develop models for the tasks of (1) determining if a question is answerable given the context and (2) identifying the answer texts within the context using IOB tagging.
arXiv Detail & Related papers (2022-12-04T21:58:33Z) - MaXM: Towards Multilingual Visual Question Answering [28.268881608141303]
We propose scalable solutions to multilingual visual question answering (mVQA) on both data and modeling fronts.
We first propose a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers.
Then, we apply our framework to the multilingual captions in the Crossmodal-3600 dataset and develop an efficient annotation protocol to create MaXM, a test-only VQA benchmark in 7 diverse languages.
arXiv Detail & Related papers (2022-09-12T16:53:37Z) - Delving Deeper into Cross-lingual Visual Question Answering [115.16614806717341]
We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance.
We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
arXiv Detail & Related papers (2022-02-15T18:22:18Z) - LaTr: Layout-Aware Transformer for Scene-Text VQA [8.390314291424263]
We propose a novel architecture for Scene Text Visual Question Answering (STVQA)
We show that applying this pre-training scheme on scanned documents has certain advantages over using natural images.
Compared to existing approaches, our method performs vocabulary-free decoding and, as shown, generalizes well beyond the training vocabulary.
arXiv Detail & Related papers (2021-12-23T12:41:26Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.