Embracing Language Inclusivity and Diversity in CLIP through Continual
Language Learning
- URL: http://arxiv.org/abs/2401.17186v1
- Date: Tue, 30 Jan 2024 17:14:05 GMT
- Title: Embracing Language Inclusivity and Diversity in CLIP through Continual
Language Learning
- Authors: Bang Yang, Yong Dai, Xuxin Cheng, Yaowei Li, Asif Raza, Yuexian Zou
- Abstract summary: Vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, but their mastery in a few languages like English restricts their applicability in broader communities.
We propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF)
We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance.
- Score: 58.92843729869586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While vision-language pre-trained models (VL-PTMs) have advanced multimodal
research in recent years, their mastery in a few languages like English
restricts their applicability in broader communities. To this end, there is an
increasing interest in developing multilingual VL models via a joint-learning
setup, which, however, could be unrealistic due to expensive costs and data
availability. In this work, we propose to extend VL-PTMs' language capacity by
continual language learning (CLL), where a model needs to update its linguistic
knowledge incrementally without suffering from catastrophic forgetting (CF). We
begin our study by introducing a model dubbed CLL-CLIP, which builds upon CLIP,
a prevailing VL-PTM that has acquired image-English text alignment.
Specifically, CLL-CLIP contains an expandable token embedding layer to handle
linguistic differences. It solely trains token embeddings to improve memory
stability and is optimized under cross-modal and cross-lingual objectives to
learn the alignment between images and multilingual texts. To alleviate CF
raised by covariate shift and lexical overlap, we further propose a novel
approach that ensures the identical distribution of all token embeddings during
initialization and regularizes token embedding learning during training. We
construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600
datasets and then evaluate multilingual image-text retrieval performance.
Extensive experiments verify the effectiveness of CLL-CLIP and show that our
approach can boost CLL-CLIP, e.g., by 6.7% in text-to-image average Recall@1 on
XM3600, and improve various state-of-the-art methods consistently. Our code and
data are available at \url{https://github.com/yangbang18/CLFM}.
Related papers
- Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer [5.355430735475281]
This paper investigates the complexities of multilingual prompt-based code generation.
Our evaluations reveal significant disparities in code quality for non-English prompts.
We propose a zero-shot cross-lingual approach using a neural projection technique.
arXiv Detail & Related papers (2024-08-19T05:11:46Z) - Large Language Models for cross-language code clone detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction with the software engineering community.
Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.
arXiv Detail & Related papers (2024-08-08T12:57:14Z) - A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene [11.265838907079196]
We propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-language model, called DC-CLIP, for both Chinese and English context.
In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual vision-language feature distillation and alignment.
Comprehensive experiments in zero-shot image classification, conducted based on the ELEVATER benchmark, showcase that DC-CLIP achieves superior performance in the English context.
arXiv Detail & Related papers (2024-04-17T10:56:06Z) - mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs [50.17767479660832]
Vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to understand' the image input.
We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware.
arXiv Detail & Related papers (2023-07-13T17:51:58Z) - Parameter-Efficient Cross-lingual Transfer of Vision and Language Models
via Translation-based Alignment [31.885608173448368]
Pre-trained vision and language models such as CLIP have witnessed remarkable success in connecting images and texts with a primary focus on English texts.
disparities in performance among different languages have been observed due to uneven resource availability.
We propose a new parameter-efficient cross-lingual transfer learning framework that utilizes a translation-based alignment method to mitigate multilingual disparities.
arXiv Detail & Related papers (2023-05-02T14:09:02Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - A Multi-level Supervised Contrastive Learning Framework for Low-Resource
Natural Language Inference [54.678516076366506]
Natural Language Inference (NLI) is a growingly essential task in natural language understanding.
Here we propose a multi-level supervised contrastive learning framework named MultiSCL for low-resource natural language inference.
arXiv Detail & Related papers (2022-05-31T05:54:18Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - A Multilingual Modeling Method for Span-Extraction Reading Comprehension [2.4905424368103444]
We propose a multilingual extractive reading comprehension approach called XLRC.
We show that our model outperforms the state-of-the-art baseline (i.e., RoBERTa_Large) on the CMRC 2018 task.
arXiv Detail & Related papers (2021-05-31T11:05:30Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.