MaXM: Towards Multilingual Visual Question Answering
- URL: http://arxiv.org/abs/2209.05401v3
- Date: Tue, 24 Oct 2023 05:59:52 GMT
- Title: MaXM: Towards Multilingual Visual Question Answering
- Authors: Soravit Changpinyo, Linting Xue, Michal Yarom, Ashish V. Thapliyal,
Idan Szpektor, Julien Amelot, Xi Chen, Radu Soricut
- Abstract summary: We propose scalable solutions to multilingual visual question answering (mVQA) on both data and modeling fronts.
We first propose a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers.
Then, we apply our framework to the multilingual captions in the Crossmodal-3600 dataset and develop an efficient annotation protocol to create MaXM, a test-only VQA benchmark in 7 diverse languages.
- Score: 28.268881608141303
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual Question Answering (VQA) has been primarily studied through the lens
of the English language. Yet, tackling VQA in other languages in the same
manner would require a considerable amount of resources. In this paper, we
propose scalable solutions to multilingual visual question answering (mVQA), on
both data and modeling fronts. We first propose a translation-based framework
to mVQA data generation that requires much less human annotation efforts than
the conventional approach of directly collection questions and answers. Then,
we apply our framework to the multilingual captions in the Crossmodal-3600
dataset and develop an efficient annotation protocol to create MaXM, a
test-only VQA benchmark in 7 diverse languages. Finally, we develop a simple,
lightweight, and effective approach as well as benchmark state-of-the-art
English and multilingual VQA models. We hope that our benchmark encourages
further research on mVQA.
Related papers
- Towards Multilingual Audio-Visual Question Answering [1.3194391758295114]
We leverage machine translation and present two multilingual AVQA datasets for eight languages.
This prevents extra human annotation efforts of collecting questions and answers manually.
We introduce a suite of models namely MERA-L, MERA-C, MERA-T with varied model architectures to benchmark the proposed datasets.
arXiv Detail & Related papers (2024-06-13T14:18:56Z) - MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering [58.92057773071854]
We introduce MTVQA, the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
MTVQA is the first benchmark featuring high-quality human expert annotations across 9 diverse languages.
arXiv Detail & Related papers (2024-05-20T12:35:01Z) - Language Guided Visual Question Answering: Elevate Your Multimodal
Language Model Using Knowledge-Enriched Prompts [54.072432123447854]
Visual question answering (VQA) is the task of answering questions about an image.
Answering the question requires commonsense knowledge, world knowledge, and reasoning about ideas and concepts not present in the image.
We propose a framework that uses language guidance (LG) in the form of rationales, image captions, scene graphs, etc to answer questions more accurately.
arXiv Detail & Related papers (2023-10-31T03:54:11Z) - PAXQA: Generating Cross-lingual Question Answering Examples at Training
Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages.
We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts.
We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z) - Delving Deeper into Cross-lingual Visual Question Answering [115.16614806717341]
We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance.
We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
arXiv Detail & Related papers (2022-02-15T18:22:18Z) - MFAQ: a Multilingual FAQ Dataset [9.625301186732598]
We present the first multilingual FAQ dataset publicly available.
We collected around 6M FAQ pairs from the web, in 21 different languages.
We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset.
arXiv Detail & Related papers (2021-09-27T08:43:25Z) - xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.