Exploiting CLIP-based Multi-modal Approach for Artwork Classification
and Retrieval
- URL: http://arxiv.org/abs/2309.12110v1
- Date: Thu, 21 Sep 2023 14:29:44 GMT
- Title: Exploiting CLIP-based Multi-modal Approach for Artwork Classification
and Retrieval
- Authors: Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del
Bimbo
- Abstract summary: We perform exhaustive experiments on the NoisyArt dataset which is a dataset of artwork images crawled from public resources on the web.
On such dataset CLIP achieves impressive results on (zero-shot) classification and promising results in both artwork-to-artwork and description-to-artwork domain.
- Score: 29.419743866789187
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Given the recent advances in multimodal image pretraining where visual models
trained with semantically dense textual supervision tend to have better
generalization capabilities than those trained using categorical attributes or
through unsupervised techniques, in this work we investigate how recent CLIP
model can be applied in several tasks in artwork domain. We perform exhaustive
experiments on the NoisyArt dataset which is a dataset of artwork images
crawled from public resources on the web. On such dataset CLIP achieves
impressive results on (zero-shot) classification and promising results in both
artwork-to-artwork and description-to-artwork domain.
Related papers
- Masked Image Modeling: A Survey [73.21154550957898]
Masked image modeling emerged as a powerful self-supervised learning technique in computer vision.
We construct a taxonomy and review the most prominent papers in recent years.
We aggregate the performance results of various masked image modeling methods on the most popular datasets.
arXiv Detail & Related papers (2024-08-13T07:27:02Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - Domain Generalization for Mammographic Image Analysis with Contrastive
Learning [62.25104935889111]
The training of an efficacious deep learning model requires large data with diverse styles and qualities.
A novel contrastive learning is developed to equip the deep learning models with better style generalization capability.
The proposed method has been evaluated extensively and rigorously with mammograms from various vendor style domains and several public datasets.
arXiv Detail & Related papers (2023-04-20T11:40:21Z) - StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based
Domain Generalization [26.08922351077744]
StyLIP is a novel approach for Domain Generalization that enhances CLIP's classification performance across domains.
Our method focuses on a domain-agnostic prompt learning strategy, aiming to disentangle the visual style and content information embedded in CLIP's pre-trained vision encoder.
arXiv Detail & Related papers (2023-02-18T07:36:16Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification [7.6146285961466]
We are one of the first methods to use CLIP (Contrastive Language-Image Pre-Training) to train a neural network on a variety of artwork images and text descriptions pairs.
Our approach aims to solve 2 challenges: instance retrieval and fine-grained artwork attribute recognition.
In this benchmark we achieved competitive results using only self-supervision.
arXiv Detail & Related papers (2022-04-29T17:17:24Z) - Multimodal Contrastive Training for Visual Representation Learning [45.94662252627284]
We develop an approach to learning visual representations that embraces multimodal data.
Our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlation simultaneously.
By including multimodal training in a unified framework, our method can learn more powerful and generic visual features.
arXiv Detail & Related papers (2021-04-26T19:23:36Z) - Region Comparison Network for Interpretable Few-shot Image
Classification [97.97902360117368]
Few-shot image classification has been proposed to effectively use only a limited number of labeled examples to train models for new classes.
We propose a metric learning based method named Region Comparison Network (RCN), which is able to reveal how few-shot learning works.
We also present a new way to generalize the interpretability from the level of tasks to categories.
arXiv Detail & Related papers (2020-09-08T07:29:05Z) - Multiple instance learning on deep features for weakly supervised object
detection with extreme domain shifts [1.9336815376402716]
Weakly supervised object detection (WSOD) using only image-level annotations has attracted a growing attention over the past few years.
We show that a simple multiple instance approach applied on pre-trained deep features yields excellent performances on non-photographic datasets.
arXiv Detail & Related papers (2020-08-03T20:36:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.