LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task
- URL: http://arxiv.org/abs/2408.13909v1
- Date: Sun, 25 Aug 2024 18:10:16 GMT
- Title: LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task
- Authors: Ali Asgarov, Samir Rustamov,
- Abstract summary: This research explores the development of vision-language models for image retrieval in low-resource languages, specifically Azerbaijani.
We integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance.
Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This research explores the development of multimodal vision-language models for image retrieval in low-resource languages, specifically Azerbaijani. Existing vision-language models primarily support high-resource languages, and fine-tuning them remains computationally demanding. To address challenges in vision-language retrieval for low-resource languages, we integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance. These techniques include synthetic data generation through machine translation, image augmentation, and further training the attention mechanisms of transformer-based models with domain-specific data. We integrated Multilingual BERT as a text encoder with image encoders like ResNet50, EfficientNet0, Vision Transformer (ViT), and Tiny Swin Transformer. Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on, such as COCO, Flickr30k, and Flickr8k. Augmentation techniques boosted EfficientNet0 MAP on Flickr30k from 0.84 to 0.87 and ResNet50 MAP on MSCOCO from 0.70 to 0.80, contributing to a new state of the art in vision-language retrieval. We share our configurations and results to support further research. Code and pre-trained models are available at https://github.com/aliasgerovs/azclip.
Related papers
- From Unimodal to Multimodal: Scaling up Projectors to Align Modalities [16.733970553781887]
We propose a novel approach that aligns vision and language modalities using only projection layers on pretrained, frozen unimodal encoders.
Our method exploits the high semantic similarity between embedding spaces of well-trained vision and language models.
It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple projectors.
arXiv Detail & Related papers (2024-09-28T17:57:32Z) - RWKV-CLIP: A Robust Vision-Language Representation Learner [31.501759213619646]
Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks.
We introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags.
We propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs.
arXiv Detail & Related papers (2024-06-11T06:10:46Z) - M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale
Efficient Pretraining [26.262677587795242]
We introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs.
To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation.
We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B.
arXiv Detail & Related papers (2024-01-29T05:43:33Z) - CAPIVARA: Cost-Efficient Approach for Improving Multilingual CLIP
Performance on Low-Resource Languages [3.760470440988674]
CAPIVARA is a cost-efficient framework designed to enhance the performance of multilingual CLIP models in low-resource languages.
We use image captioning and machine translation to generate synthetic captions in low-resource languages.
We show the potential for significant improvements in other low-resource languages, achieved by fine-tuning the pre-trained multilingual CLIP on a single GPU for 2 hours.
arXiv Detail & Related papers (2023-10-20T17:44:25Z) - NLLB-CLIP -- train performant multilingual image retrieval model on a
budget [65.268245109828]
We present NLLB-CLIP - CLIP model with a text encoder from the NLLB model.
We used an automatically created dataset of 106,246 good-quality images with captions in 201 languages.
We show that NLLB-CLIP is comparable in quality to state-of-the-art models and significantly outperforms them on low-resource languages.
arXiv Detail & Related papers (2023-09-04T23:26:11Z) - Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations [53.89380284760555]
We introduce Babel-ImageNet, a massively multilingual benchmark that offers partial translations of ImageNet labels to 100 languages.
We evaluate 11 public multilingual CLIP models on our benchmark, demonstrating a significant gap between English ImageNet performance and that of high-resource languages.
We show that the performance of multilingual CLIP can be drastically improved for low-resource languages with parameter-efficient language-specific training.
arXiv Detail & Related papers (2023-06-14T17:53:06Z) - Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - GIT: A Generative Image-to-text Transformer for Vision and Language [138.91581326369837]
We train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering.
Our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr)
arXiv Detail & Related papers (2022-05-27T17:03:38Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.