cViL: Cross-Lingual Training of Vision-Language Models using Knowledge
Distillation
- URL: http://arxiv.org/abs/2206.03354v2
- Date: Thu, 9 Jun 2022 05:40:02 GMT
- Title: cViL: Cross-Lingual Training of Vision-Language Models using Knowledge
Distillation
- Authors: Kshitij Gupta, Devansh Gautam, Radhika Mamidi
- Abstract summary: We propose a pipeline that utilizes English-only vision-language models to train a monolingual model for a target language.
We release a large-scale visual question answering dataset in Japanese and Hindi language.
Our pipeline outperforms the current state-of-the-art models by a relative increase of 4.4% and 13.4% respectively in accuracy.
- Score: 6.381149074212897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-language tasks are gaining popularity in the research community,
but the focus is still mainly on English. We propose a pipeline that utilizes
English-only vision-language models to train a monolingual model for a target
language. We propose to extend OSCAR+, a model which leverages object tags as
anchor points for learning image-text alignments, to train on visual question
answering datasets in different languages. We propose a novel approach to
knowledge distillation to train the model in other languages using parallel
sentences. Compared to other models that use the target language in the
pretraining corpora, we can leverage an existing English model to transfer the
knowledge to the target language using significantly lesser resources. We also
release a large-scale visual question answering dataset in Japanese and Hindi
language. Though we restrict our work to visual question answering, our model
can be extended to any sequence-level classification task, and it can be
extended to other languages as well. This paper focuses on two languages for
the visual question answering task - Japanese and Hindi. Our pipeline
outperforms the current state-of-the-art models by a relative increase of 4.4%
and 13.4% respectively in accuracy.
Related papers
- LEIA: Facilitating Cross-lingual Knowledge Transfer in Language Models with Entity-based Data Augmentation [21.980770995466134]
We introduce LEIA, a language adaptation tuning method that utilizes Wikipedia entity names aligned across languages.
This method involves augmenting the target language corpus with English entity names and training the model using left-to-right language modeling.
arXiv Detail & Related papers (2024-02-18T07:24:34Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - Towards Developing a Multilingual and Code-Mixed Visual Question
Answering System by Knowledge Distillation [20.33235443471006]
We propose a knowledge distillation approach to extend an English language-vision model (teacher) into an equally effective multilingual and code-mixed model (student)
We also create the large-scale multilingual and code-mixed VQA dataset in eleven different language setups.
Experimental results and in-depth analysis show the effectiveness of the proposed VQA model over the pre-trained language-vision models on eleven diverse language setups.
arXiv Detail & Related papers (2021-09-10T03:47:29Z) - Cross-lingual Emotion Detection [6.767035411834297]
We consider English as the source language with Arabic and Spanish as target languages.
Our BERT-based monolingual models that are trained on target language data surpass state-of-the-art (SOTA) by 4% and 5% absolute Jaccard score for Arabic and Spanish respectively.
Next, we show that using cross-lingual approaches with English data alone, we can achieve more than 90% and 80% relative effectiveness of the Arabic and Spanish BERT models respectively.
arXiv Detail & Related papers (2021-06-10T19:52:06Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.