NLLB-CLIP -- train performant multilingual image retrieval model on a
budget
- URL: http://arxiv.org/abs/2309.01859v3
- Date: Wed, 1 Nov 2023 18:43:34 GMT
- Title: NLLB-CLIP -- train performant multilingual image retrieval model on a
budget
- Authors: Alexander Visheratin
- Abstract summary: We present NLLB-CLIP - CLIP model with a text encoder from the NLLB model.
We used an automatically created dataset of 106,246 good-quality images with captions in 201 languages.
We show that NLLB-CLIP is comparable in quality to state-of-the-art models and significantly outperforms them on low-resource languages.
- Score: 65.268245109828
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Today, the exponential rise of large models developed by academic and
industrial institutions with the help of massive computing resources raises the
question of whether someone without access to such resources can make a
valuable scientific contribution. To explore this, we tried to solve the
challenging task of multilingual image retrieval having a limited budget of
$1,000. As a result, we present NLLB-CLIP - CLIP model with a text encoder from
the NLLB model. To train the model, we used an automatically created dataset of
106,246 good-quality images with captions in 201 languages derived from the
LAION COCO dataset. We trained multiple models using image and text encoders of
various sizes and kept different parts of the model frozen during the training.
We thoroughly analyzed the trained models using existing evaluation datasets
and newly created XTD200 and Flickr30k-200 datasets. We show that NLLB-CLIP is
comparable in quality to state-of-the-art models and significantly outperforms
them on low-resource languages.
Related papers
- EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models [36.576853882830896]
We introduce EvolveDirector to train a text-to-image generation model comparable to advanced models using publicly available resources.
This framework interacts with advanced models through their public APIs to obtain text-image data pairs to train a base model.
We leverage pre-trained large vision-language models (VLMs) to guide the evolution of the base model.
arXiv Detail & Related papers (2024-10-09T17:52:28Z) - LowCLIP: Adapting the CLIP Model Architecture for Low-Resource Languages in Multimodal Image Retrieval Task [0.0]
This research explores the development of vision-language models for image retrieval in low-resource languages, specifically Azerbaijani.
We integrated the CLIP model architecture and employed several techniques to balance computational efficiency with performance.
Our study found that models like EfficientNet0 and Tiny Swin Transformer perform best on the datasets they were trained on.
arXiv Detail & Related papers (2024-08-25T18:10:16Z) - Yi: Open Foundation Models by 01.AI [42.94680878285869]
Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models.
Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our fine chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Arena.
arXiv Detail & Related papers (2024-03-07T16:52:49Z) - Reformulating Vision-Language Foundation Models and Datasets Towards
Universal Multimodal Assistants [65.47222691674074]
Muffin framework employs pre-trained vision-language models to act as providers of visual signals.
UniMM-Chat dataset explores the complementarities of datasets to generate 1.1M high-quality and diverse multimodal instructions.
arXiv Detail & Related papers (2023-10-01T12:35:18Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - DINOv2: Learning Robust Visual Features without Supervision [75.42921276202522]
This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources.
Most of the technical contributions aim at accelerating and stabilizing the training at scale.
In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature.
arXiv Detail & Related papers (2023-04-14T15:12:19Z) - Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z) - LAION-5B: An open large-scale dataset for training next generation
image-text models [16.129935376579326]
We present LAION-5B, a dataset consisting of 5.85 billion CLIP-filtered image-text pairs, of which 2.32B contain English language.
We show successful replication and fine-tuning of foundational models like CLIP, GLIDE and Stable Diffusion using the dataset.
We also provide several nearest neighbor indices, an improved web-interface for dataset exploration and subset generation.
arXiv Detail & Related papers (2022-10-16T00:08:18Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - KNN-Diffusion: Image Generation via Large-Scale Retrieval [40.6656651653888]
Learning to adapt enables several new capabilities.
Fine-tuning trained models to new samples can be achieved by simply adding them to the table.
Our diffusion-based model trains on images only, by leveraging a joint Text-Image multi-modal metric.
arXiv Detail & Related papers (2022-04-06T14:13:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.