InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images
- URL: http://arxiv.org/abs/2211.12760v1
- Date: Wed, 23 Nov 2022 08:09:50 GMT
- Title: InDiReCT: Language-Guided Zero-Shot Deep Metric Learning for Images
- Authors: Konstantin Kobs, Michael Steininger, Andreas Hotho
- Abstract summary: We argue that depending on the application, users of image retrieval systems have different and changing similarity notions.
We present Language-Guided Zero-Shot Deep Metric Learning (LanZ-DML) as a new DML setting.
InDiReCT is a model for LanZ-DML on images that exclusively uses a few text prompts for training.
- Score: 4.544151613454639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Common Deep Metric Learning (DML) datasets specify only one notion of
similarity, e.g., two images in the Cars196 dataset are deemed similar if they
show the same car model. We argue that depending on the application, users of
image retrieval systems have different and changing similarity notions that
should be incorporated as easily as possible. Therefore, we present
Language-Guided Zero-Shot Deep Metric Learning (LanZ-DML) as a new DML setting
in which users control the properties that should be important for image
representations without training data by only using natural language. To this
end, we propose InDiReCT (Image representations using Dimensionality Reduction
on CLIP embedded Texts), a model for LanZ-DML on images that exclusively uses a
few text prompts for training. InDiReCT utilizes CLIP as a fixed feature
extractor for images and texts and transfers the variation in text prompt
embeddings to the image embedding space. Extensive experiments on five datasets
and overall thirteen similarity notions show that, despite not seeing any
images during training, InDiReCT performs better than strong baselines and
approaches the performance of fully-supervised models. An analysis reveals that
InDiReCT learns to focus on regions of the image that correlate with the
desired similarity notion, which makes it a fast to train and easy to use
method to create custom embedding spaces only using natural language.
Related papers
- Multilingual Vision-Language Pre-training for the Remote Sensing Domain [4.118895088882213]
Methods based on Contrastive Language-Image Pre-training (CLIP) are nowadays extensively used in support of vision-and-language tasks involving remote sensing data.
This work proposes a novel vision-and-language model for the remote sensing domain, exploring the fine-tuning of a multilingual CLIP model.
Our resulting model, which we named Remote Sensing Multilingual CLIP (RS-M-CLIP), obtains state-of-the-art results in a variety of vision-and-language tasks.
arXiv Detail & Related papers (2024-10-30T18:13:11Z) - Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP)
SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs.
Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z) - Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models [21.17975741743583]
It has recently been discovered that using a pre-trained vision-language model (VLM), e.g., CLIP, to align a whole query image with several finer text descriptions can significantly enhance zero-shot performance.
In this paper, we empirically find that the finer descriptions tend to align more effectively with local areas of the query image rather than the whole image.
arXiv Detail & Related papers (2024-06-05T04:08:41Z) - Self-Supervised Pre-training with Symmetric Superimposition Modeling for Scene Text Recognition [43.61569815081384]
We propose Symmetric Superimposition Modeling to simultaneously capture local character features and linguistic information in text images.
At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context.
At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination.
arXiv Detail & Related papers (2024-05-09T15:23:38Z) - Exploring Simple Open-Vocabulary Semantic Segmentation [7.245983878396646]
Open-vocabulary semantic segmentation models aim to accurately assign a semantic label to each pixel in an image from a set of arbitrary open-vocabulary texts.
In this paper, we introduce S-Seg, a novel model that can achieve surprisingly strong performance without depending on any of the above elements.
arXiv Detail & Related papers (2024-01-22T18:59:29Z) - User-Aware Prefix-Tuning is a Good Learner for Personalized Image
Captioning [35.211749514733846]
Traditional image captioning methods often overlook the preferences and characteristics of users.
Most existing methods emphasize the user context fusion process by memory networks or transformers.
We propose a novel personalized image captioning framework that leverages user context to consider personality factors.
arXiv Detail & Related papers (2023-12-08T02:08:00Z) - STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
arXiv Detail & Related papers (2023-01-30T17:21:30Z) - Unified Contrastive Learning in Image-Text-Label Space [130.31947133453406]
Unified Contrastive Learning (UniCL) is effective way of learning semantically rich yet discriminative representations.
UniCL stand-alone is a good learner on pure imagelabel data, rivaling the supervised learning methods across three image classification datasets.
arXiv Detail & Related papers (2022-04-07T17:34:51Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.