CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification
- URL: http://arxiv.org/abs/2204.14244v1
- Date: Fri, 29 Apr 2022 17:17:24 GMT
- Title: CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification
- Authors: Marcos V. Conde, Kerem Turgutlu
- Abstract summary: We are one of the first methods to use CLIP (Contrastive Language-Image Pre-Training) to train a neural network on a variety of artwork images and text descriptions pairs.
Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition.
In this benchmark we achieved competitive results using only self-supervision.
- Score: 7.6146285961466
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Existing computer vision research in artwork struggles with artwork's
fine-grained attributes recognition and lack of curated annotated datasets due
to their costly creation. To the best of our knowledge, we are one of the first
methods to use CLIP (Contrastive Language-Image Pre-Training) to train a neural
network on a variety of artwork images and text descriptions pairs. CLIP is
able to learn directly from free-form art descriptions, or, if available,
curated fine-grained labels. Model's zero-shot capability allows predicting
accurate natural language description for a given image, without directly
optimizing for the task. Our approach aims to solve 2 challenges: instance
retrieval and fine-grained artwork attribute recognition. We use the iMet
Dataset, which we consider the largest annotated artwork dataset. In this
benchmark we achieved competitive results using only self-supervision.
Related papers
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning [78.19528555505961]
We propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data.
The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation.
Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets, but can also leverage interleaved pre-training data.
arXiv Detail & Related papers (2024-06-11T17:59:35Z) - Exploiting CLIP-based Multi-modal Approach for Artwork Classification
and Retrieval [29.419743866789187]
We perform exhaustive experiments on the NoisyArt dataset which is a dataset of artwork images crawled from public resources on the web.
On such dataset CLIP achieves impressive results on (zero-shot) classification and promising results in both artwork-to-artwork and description-to-artwork domain.
arXiv Detail & Related papers (2023-09-21T14:29:44Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - Self-Supervised Image Captioning with CLIP [0.0]
We introduce a self-supervised image captioning method.
After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data.
Despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset.
arXiv Detail & Related papers (2023-06-26T23:29:16Z) - PLIP: Language-Image Pre-training for Person Representation Learning [51.348303233290025]
We propose a novel language-image pre-training framework for person representation learning, termed PLIP.
To implement our framework, we construct a large-scale person dataset with image-text pairs named SYNTH-PEDES.
PLIP not only significantly improves existing methods on all these tasks, but also shows great ability in the zero-shot and domain generalization settings.
arXiv Detail & Related papers (2023-05-15T06:49:00Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner.
Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting [91.56988987393483]
We present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP.
Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models.
Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones.
arXiv Detail & Related papers (2021-12-02T18:59:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.