Delving Deeper into Cross-lingual Visual Question Answering
- URL: http://arxiv.org/abs/2202.07630v2
- Date: Thu, 8 Jun 2023 18:33:28 GMT
- Title: Delving Deeper into Cross-lingual Visual Question Answering
- Authors: Chen Liu, Jonas Pfeiffer, Anna Korhonen, Ivan Vuli\'c, Iryna Gurevych
- Abstract summary: We show that simple modifications to the standard training setup can substantially reduce the transfer gap to monolingual English performance.
We analyze cross-lingual VQA across different question types of varying complexity for different multilingual multimodal Transformers.
- Score: 115.16614806717341
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Visual question answering (VQA) is one of the crucial vision-and-language
tasks. Yet, existing VQA research has mostly focused on the English language,
due to a lack of suitable evaluation resources. Previous work on cross-lingual
VQA has reported poor zero-shot transfer performance of current multilingual
multimodal Transformers with large gaps to monolingual performance, without any
deeper analysis. In this work, we delve deeper into the different aspects of
cross-lingual VQA, aiming to understand the impact of 1) modeling methods and
choices, including architecture, inductive bias, fine-tuning; 2) learning
biases: including question types and modality biases in cross-lingual setups.
The key results of our analysis are: 1) We show that simple modifications to
the standard training setup can substantially reduce the transfer gap to
monolingual English performance, yielding +10 accuracy points over existing
methods. 2) We analyze cross-lingual VQA across different question types of
varying complexity for different multilingual multimodal Transformers, and
identify question types that are the most difficult to improve on. 3) We
provide an analysis of modality biases present in training data and models,
revealing why zero-shot performance gaps remain for certain question types and
languages.
Related papers
- Bridging the Language Gap: Knowledge Injected Multilingual Question
Answering [19.768708263635176]
We propose a generalized cross-lingual transfer framework to enhance the model's ability to understand different languages.
Experiment results on real-world datasets MLQA demonstrate that the proposed method can improve the performance by a large margin.
arXiv Detail & Related papers (2023-04-06T15:41:25Z) - Improving the Cross-Lingual Generalisation in Visual Question Answering [40.86774711775718]
multilingual vision-language pretrained models show poor cross-lingual generalisation when applied to non-English data.
In this work, we explore the poor performance of these models on a zero-shot cross-lingual visual question answering (VQA) task.
We improve cross-lingual transfer with three strategies: (1) we introduce a linguistic prior objective to augment the cross-entropy loss with a similarity-based loss to guide the model during training, (2) we learn a task-specific subnetwork that improves cross-lingual generalisation and reduces variance without model modification, and (3) we augment training examples using synthetic code
arXiv Detail & Related papers (2022-09-07T08:07:43Z) - Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of
Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process.
We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks.
Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z) - xGQA: Cross-Lingual Visual Question Answering [100.35229218735938]
xGQA is a new multilingual evaluation benchmark for the visual question answering task.
We extend the established English GQA dataset to 7 typologically diverse languages.
We propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual.
arXiv Detail & Related papers (2021-09-13T15:58:21Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Multilingual Answer Sentence Reranking via Automatically Translated Data [97.98885151955467]
We present a study on the design of multilingual Answer Sentence Selection (AS2) models, which are a core component of modern Question Answering (QA) systems.
The main idea is to transfer data, created from one resource rich language, e.g., English, to other languages, less rich in terms of resources.
arXiv Detail & Related papers (2021-02-20T03:52:08Z) - Learning from Lexical Perturbations for Consistent Visual Question
Answering [78.21912474223926]
Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations.
We propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations.
We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations.
arXiv Detail & Related papers (2020-11-26T17:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.