CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP
Performance on Low-Resource Languages
- URL: http://arxiv.org/abs/2310.13683v2
- Date: Mon, 23 Oct 2023 17:06:07 GMT
- Title: CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP
Performance on Low-Resource Languages
- Authors: Gabriel Oliveira dos Santos, Diego A. B. Moreira, Alef Iury Ferreira,
Jhessica Silva, Luiz Pereira, Pedro Bueno, Thiago Sousa, Helena Maia, N\'adia
Da Silva, Esther Colombini, Helio Pedrini and Sandra Avila
- Abstract summary: CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages.
We use image captioning and machine translation to generate synthetic captions in low-resource languages.
We show the potential for significant improvements in other low-resource languages, achieved by fine-tuning the pre-trained multilingual CLIP on a single GPU for 2 hours.
- Score: 3.760470440988674
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This work introduces CAPIVARA, a cost-efficient framework designed to enhance
the performance of multilingual CLIP models in low-resource languages. While
CLIP has excelled in zero-shot vision-language tasks, the resource-intensive
nature of model training remains challenging. Many datasets lack linguistic
diversity, featuring solely English descriptions for images. CAPIVARA addresses
this by augmenting text data using image captioning and machine translation to
generate multiple synthetic captions in low-resource languages. We optimize the
training pipeline with LiT, LoRA, and gradient checkpointing to alleviate the
computational cost. Through extensive experiments, CAPIVARA emerges as state of
the art in zero-shot tasks involving images and Portuguese texts. We show the
potential for significant improvements in other low-resource languages,
achieved by fine-tuning the pre-trained multilingual CLIP using CAPIVARA on a
single GPU for 2 hours. Our model and code is available at
https://github.com/hiaac-nlp/CAPIVARA.
Related papers
- A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene [11.265838907079196]
We propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-language model, called DC-CLIP, for both Chinese and English context.
In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual vision-language feature distillation and alignment.
Comprehensive experiments in zero-shot image classification, conducted based on the ELEVATER benchmark, showcase that DC-CLIP achieves superior performance in the English context.
arXiv Detail & Related papers (2024-04-17T10:56:06Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - Embracing Language Inclusivity and Diversity in CLIP through Continual
Language Learning [58.92843729869586]
Vision-language pre-trained models (VL-PTMs) have advanced multimodal research in recent years, but their mastery in a few languages like English restricts their applicability in broader communities.
We propose to extend VL-PTMs' language capacity by continual language learning (CLL), where a model needs to update its linguistic knowledge incrementally without suffering from catastrophic forgetting (CF)
We construct a CLL benchmark covering 36 languages based on MSCOCO and XM3600 datasets and then evaluate multilingual image-text retrieval performance.
arXiv Detail & Related papers (2024-01-30T17:14:05Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Can Character-based Language Models Improve Downstream Task Performance
in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z) - From Masked Language Modeling to Translation: Non-English Auxiliary
Tasks Improve Zero-shot Spoken Language Understanding [24.149299722716155]
We introduce xSID, a new benchmark for cross-lingual Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect.
We propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer.
Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.
arXiv Detail & Related papers (2021-05-15T23:51:11Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.