Contrastive Language-Image Pre-training for the Italian Language
- URL: http://arxiv.org/abs/2108.08688v1
- Date: Thu, 19 Aug 2021 13:53:47 GMT
- Title: Contrastive Language-Image Pre-training for the Italian Language
- Authors: Federico Bianchi, Giuseppe Attanasio, Raphael Pisoni, Silvia Terragni,
Gabriele Sarti, Sri Lakshmi
- Abstract summary: We present the first CLIP model for the Italian Language (CLIP-Italian) trained on more than 1.4 million image-text pairs.
Results show that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image retrieval and zero-shot classification.
- Score: 4.804798944613199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: CLIP (Contrastive Language-Image Pre-training) is a very recent multi-modal
model that jointly learns representations of images and texts. The model is
trained on a massive amount of English data and shows impressive performance on
zero-shot classification tasks. Training the same model on a different language
is not trivial, since data in other languages might be not enough and the model
needs high-quality translations of the texts to guarantee a good performance.
In this paper, we present the first CLIP model for the Italian Language
(CLIP-Italian), trained on more than 1.4 million image-text pairs. Results show
that CLIP-Italian outperforms the multilingual CLIP model on the tasks of image
retrieval and zero-shot classification.
Related papers
- CLEFT: Language-Image Contrastive Learning with Efficient Large Language Model and Prompt Fine-Tuning [4.004641316826348]
We introduce a novel language-image Contrastive Learning method with an Efficient large language model and prompt Fine-Tuning (CLEFT)
Our method demonstrates state-of-the-art performance on multiple chest X-ray and mammography datasets.
The proposed parameter efficient framework can reduce the total trainable model size by 39% and reduce the trainable language model to only 4% compared with the current BERT encoder.
arXiv Detail & Related papers (2024-07-30T17:57:32Z) - A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene [11.265838907079196]
We propose a conceptually simple yet effective multilingual CLIP Compression framework and train a lightweight multilingual vision-language model, called DC-CLIP, for both Chinese and English context.
In this framework, we collect high-quality Chinese and English text-image pairs and design two training stages, including multilingual vision-language feature distillation and alignment.
Comprehensive experiments in zero-shot image classification, conducted based on the ELEVATER benchmark, showcase that DC-CLIP achieves superior performance in the English context.
arXiv Detail & Related papers (2024-04-17T10:56:06Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations [53.89380284760555]
We introduce Babel-ImageNet, a massively multilingual benchmark that offers partial translations of ImageNet labels to 100 languages.
We evaluate 11 public multilingual CLIP models on our benchmark, demonstrating a significant gap between English ImageNet performance and that of high-resource languages.
We show that the performance of multilingual CLIP can be drastically improved for low-resource languages with parameter-efficient language-specific training.
arXiv Detail & Related papers (2023-06-14T17:53:06Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.