Granularity-aware Adaptation for Image Retrieval over Multiple Tasks
- URL: http://arxiv.org/abs/2210.02254v1
- Date: Wed, 5 Oct 2022 13:31:52 GMT
- Title: Granularity-aware Adaptation for Image Retrieval over Multiple Tasks
- Authors: Jon Almaz\'an, Byungsoo Ko, Geonmo Gu, Diane Larlus, Yannis Kalantidis
- Abstract summary: Grappa is an approach that starts from a strong pretrained model, and adapts it to tackle multiple retrieval tasks concurrently.
We reconcile all adaptor sets into a single unified model suited for all retrieval tasks by learning fusion layers.
Results on a benchmark composed of six heterogeneous retrieval tasks show that the unsupervised Grappa model improves the zero-shot performance of a state-of-the-art self-supervised learning model.
- Score: 30.505620321478688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Strong image search models can be learned for a specific domain, ie. set of
labels, provided that some labeled images of that domain are available. A
practical visual search model, however, should be versatile enough to solve
multiple retrieval tasks simultaneously, even if those cover very different
specialized domains. Additionally, it should be able to benefit from even
unlabeled images from these various retrieval tasks. This is the more practical
scenario that we consider in this paper. We address it with the proposed
Grappa, an approach that starts from a strong pretrained model, and adapts it
to tackle multiple retrieval tasks concurrently, using only unlabeled images
from the different task domains. We extend the pretrained model with multiple
independently trained sets of adaptors that use pseudo-label sets of different
sizes, effectively mimicking different pseudo-granularities. We reconcile all
adaptor sets into a single unified model suited for all retrieval tasks by
learning fusion layers that we guide by propagating pseudo-granularity
attentions across neighbors in the feature space. Results on a benchmark
composed of six heterogeneous retrieval tasks show that the unsupervised Grappa
model improves the zero-shot performance of a state-of-the-art self-supervised
learning model, and in some places reaches or improves over a task label-aware
oracle that selects the most fitting pseudo-granularity per task.
Related papers
- One Diffusion to Generate Them All [54.82732533013014]
OneDiffusion is a versatile, large-scale diffusion model that supports bidirectional image synthesis and understanding.
It enables conditional generation from inputs such as text, depth, pose, layout, and semantic maps.
OneDiffusion allows for multi-view generation, camera pose estimation, and instant personalization using sequential image inputs.
arXiv Detail & Related papers (2024-11-25T12:11:05Z) - MOWA: Multiple-in-One Image Warping Model [65.73060159073644]
We propose a Multiple-in-One image warping model (named MOWA) in this work.
We mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level.
To our knowledge, this is the first work that solves multiple practical warping tasks in one single model.
arXiv Detail & Related papers (2024-04-16T16:50:35Z) - Instruct-Imagen: Image Generation with Multi-modal Instruction [90.04481955523514]
instruct-imagen is a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks.
We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision.
Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain.
arXiv Detail & Related papers (2024-01-03T19:31:58Z) - Self-Supervised Open-Ended Classification with Small Visual Language
Models [60.23212389067007]
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models.
By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe.
arXiv Detail & Related papers (2023-09-30T21:41:21Z) - Mixture of Self-Supervised Learning [2.191505742658975]
Self-supervised learning works by using a pretext task which will be trained on the model before being applied to a specific task.
Previous studies have only used one type of transformation as a pretext task.
This raises the question of how it affects if more than one pretext task is used and to use a gating network to combine all pretext tasks.
arXiv Detail & Related papers (2023-07-27T14:38:32Z) - Multi-Domain Learning with Modulation Adapters [33.54630534228469]
Multi-domain learning aims to handle related tasks, such as image classification across multiple domains, simultaneously.
Modulation Adapters update the convolutional weights of the model in a multiplicative manner for each task.
Our approach yields excellent results, with accuracies that are comparable to or better than those of existing state-of-the-art approaches.
arXiv Detail & Related papers (2023-07-17T14:40:16Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - A Generalist Framework for Panoptic Segmentation of Images and Videos [61.61453194912186]
We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task.
A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function.
Our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically.
arXiv Detail & Related papers (2022-10-12T16:18:25Z) - Semantic Diversity Learning for Zero-Shot Multi-label Classification [14.480713752871523]
This study introduces an end-to-end model training for multi-label zero-shot learning.
We propose to use an embedding matrix having principal embedding vectors trained using a tailored loss function.
In addition, during training, we suggest up-weighting in the loss function image samples presenting higher semantic diversity to encourage the diversity of the embedding matrix.
arXiv Detail & Related papers (2021-05-12T19:39:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.