xGQA: Cross-Lingual Visual Question Answering
- URL: http://arxiv.org/abs/2109.06082v1
- Date: Mon, 13 Sep 2021 15:58:21 GMT
- Title: xGQA: Cross-Lingual Visual Question Answering
- Authors: Jonas Pfeiffer and Gregor Geigle and Aishwarya Kamath and Jan-Martin
O. Steitz and Stefan Roth and Ivan Vuli\'c and Iryna Gurevych
- Abstract summary: xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
- Score: 100.35229218735938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in multimodal vision and language modeling have predominantly
focused on the English language, mostly due to the lack of multilingual
multimodal datasets to steer modeling efforts. In this work, we address this
gap and provide xGQA, a new multilingual evaluation benchmark for the visual
question answering task. We extend the established English GQA dataset to 7
typologically diverse languages, enabling us to detect and explore crucial
challenges in cross-lingual visual question answering. We further propose new
adapter-based approaches to adapt multimodal transformer-based models to become
multilingual, and -- vice versa -- multilingual models to become multimodal.
Our proposed methods outperform current state-of-the-art multilingual
multimodal models (e.g., M3P) in zero-shot cross-lingual settings, but the
accuracy remains low across the board; a performance drop of around 38 accuracy
points in target languages showcases the difficulty of zero-shot cross-lingual
transfer for this task. Our results suggest that simple cross-lingual transfer
of multimodal models yields latent multilingual multimodal misalignment,
calling for more sophisticated methods for vision and multilingual language
modeling. The xGQA dataset is available online at:
https://github.com/Adapter-Hub/xGQA.
Related papers
- ColBERT-XM: A Modular Multi-Vector Representation Model for Zero-Shot
Multilingual Information Retrieval [10.664434993386523]
Current approaches circumvent the lack of high-quality labeled data in non-English languages.
We present a novel modular dense retrieval model that learns from the rich data of a single high-resource language.
arXiv Detail & Related papers (2024-02-23T02:21:24Z) - Meta-learning For Vision-and-language Cross-lingual Transfer [14.594704809280984]
We propose a novel meta-learning fine-tuning framework for vison-language models.
Our framework makes current PVLMs rapidly adaptive to new languages in vision-language scenarios.
Our method boosts the performance of current state-of-the-art PVLMs in both zero-shot and few-shot cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T07:51:42Z) - Multilingual Multimodal Learning with Machine Translated Text [27.7207234512674]
We investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data.
We propose two metrics for automatically removing such translations from the resulting datasets.
In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning.
arXiv Detail & Related papers (2022-10-24T11:41:20Z) - Improving the Cross-Lingual Generalisation in Visual Question Answering [40.86774711775718]
multilingual vision-language pretrained models show poor cross-lingual generalisation when applied to non-English data.
In this work, we explore the poor performance of these models on a zero-shot cross-lingual visual question answering (VQA) task.
We improve cross-lingual transfer with three strategies: (1) we introduce a linguistic prior objective to augment the cross-entropy loss with a similarity-based loss to guide the model during training, (2) we learn a task-specific subnetwork that improves cross-lingual generalisation and reduces variance without model modification, and (3) we augment training examples using synthetic code
arXiv Detail & Related papers (2022-09-07T08:07:43Z) - Towards Developing a Multilingual and Code-Mixed Visual Question
Answering System by Knowledge Distillation [20.33235443471006]
We propose a knowledge distillation approach to extend an English language-vision model (teacher) into an equally effective multilingual and code-mixed model (student)
We also create the large-scale multilingual and code-mixed VQA dataset in eleven different language setups.
Experimental results and in-depth analysis show the effectiveness of the proposed VQA model over the pre-trained language-vision models on eleven diverse language setups.
arXiv Detail & Related papers (2021-09-10T03:47:29Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual
Transfer of Vision-Language Models [144.85290716246533]
We study zero-shot cross-lingual transfer of vision-language models.
We propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.
arXiv Detail & Related papers (2021-03-16T04:37:40Z) - MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer [136.09386219006123]
We propose MAD-X, an adapter-based framework that enables high portability and parameter-efficient transfer to arbitrary tasks and languages.
MAD-X outperforms the state of the art in cross-lingual transfer across a representative set of typologically diverse languages on named entity recognition and causal commonsense reasoning.
arXiv Detail & Related papers (2020-04-30T18:54:43Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.