Few-Shot Learning with Visual Distribution Calibration and Cross-Modal
Distribution Alignment
- URL: http://arxiv.org/abs/2305.11439v1
- Date: Fri, 19 May 2023 05:45:17 GMT
- Title: Few-Shot Learning with Visual Distribution Calibration and Cross-Modal
Distribution Alignment
- Authors: Runqi Wang, Hao Zheng, Xiaoyue Duan, Jianzhuang Liu, Yuning Lu, Tian
Wang, Songcen Xu, Baochang Zhang
- Abstract summary: Pre-trained vision-language models have inspired much research on few-shot learning.
With only a few training images, the visual feature distributions are easily distracted by class-irrelevant information in images.
We propose a Selective Attack module that generates spatial attention maps of images to guide the attacks on class-irrelevant image areas.
- Score: 47.53887941065894
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained vision-language models have inspired much research on few-shot
learning. However, with only a few training images, there exist two crucial
problems: (1) the visual feature distributions are easily distracted by
class-irrelevant information in images, and (2) the alignment between the
visual and language feature distributions is difficult. To deal with the
distraction problem, we propose a Selective Attack module, which consists of
trainable adapters that generate spatial attention maps of images to guide the
attacks on class-irrelevant image areas. By messing up these areas, the
critical features are captured and the visual distributions of image features
are calibrated. To better align the visual and language feature distributions
that describe the same object class, we propose a cross-modal distribution
alignment module, in which we introduce a vision-language prototype for each
class to align the distributions, and adopt the Earth Mover's Distance (EMD) to
optimize the prototypes. For efficient computation, the upper bound of EMD is
derived. In addition, we propose an augmentation strategy to increase the
diversity of the images and the text prompts, which can reduce overfitting to
the few-shot training images. Extensive experiments on 11 datasets demonstrate
that our method consistently outperforms prior arts in few-shot learning. The
implementation code will be available at https://github.com/bhrqw/SADA.
Related papers
- AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - Learning 1D Causal Visual Representation with De-focus Attention Networks [108.72931590504406]
This paper explores the feasibility of representing images using 1D causal modeling.
We propose De-focus Attention Networks, which employ learnable bandpass filters to create varied attention patterns.
arXiv Detail & Related papers (2024-06-06T17:59:56Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Black Box Few-Shot Adaptation for Vision-Language models [41.49584259596654]
Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners.
We describe a black-box method for V-L few-shot adaptation that operates on pre-computed image and text features.
We propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain.
arXiv Detail & Related papers (2023-04-04T12:42:29Z) - Sketch-Guided Text-to-Image Diffusion Models [57.12095262189362]
We introduce a universal approach to guide a pretrained text-to-image diffusion model.
Our method does not require to train a dedicated model or a specialized encoder for the task.
We take a particular focus on the sketch-to-image translation task, revealing a robust and expressive way to generate images.
arXiv Detail & Related papers (2022-11-24T18:45:32Z) - AugNet: End-to-End Unsupervised Visual Representation Learning with
Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures.
Our experiments demonstrate that the method is able to represent the image in low dimensional space.
Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z) - Improving Few-shot Learning by Spatially-aware Matching and
CrossTransformer [116.46533207849619]
We study the impact of scale and location mismatch in the few-shot learning scenario.
We propose a novel Spatially-aware Matching scheme to effectively perform matching across multiple scales and locations.
arXiv Detail & Related papers (2020-01-06T14:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.