BatchFormerV2: Exploring Sample Relationships for Dense Representation
Learning
- URL: http://arxiv.org/abs/2204.01254v1
- Date: Mon, 4 Apr 2022 05:53:42 GMT
- Title: BatchFormerV2: Exploring Sample Relationships for Dense Representation
Learning
- Authors: Zhi Hou, Baosheng Yu, Chaoyue Wang, Yibing Zhan, Dacheng Tao
- Abstract summary: BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning.
BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
- Score: 88.82371069668147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention mechanisms have been very popular in deep neural networks, where
the Transformer architecture has achieved great success in not only natural
language processing but also visual recognition applications. Recently, a new
Transformer module, applying on batch dimension rather than spatial/channel
dimension, i.e., BatchFormer [18], has been introduced to explore sample
relationships for overcoming data scarcity challenges. However, it only works
with image-level representations for classification. In this paper, we devise a
more general batch Transformer module, BatchFormerV2, which further enables
exploring sample relationships for dense representation learning. Specifically,
when applying the proposed module, it employs a two-stream pipeline during
training, i.e., either with or without a BatchFormerV2 module, where the
batchformer stream can be removed for testing. Therefore, the proposed method
is a plug-and-play module and can be easily integrated into different vision
Transformers without any extra inference cost. Without bells and whistles, we
show the effectiveness of the proposed method for a variety of popular visual
recognition tasks, including image classification and two important dense
prediction tasks: object detection and panoptic segmentation. Particularly,
BatchFormerV2 consistently improves current DETR-based detection methods (e.g.,
DETR, Deformable-DETR, Conditional DETR, and SMCA) by over 1.3%. Code will be
made publicly available.
Related papers
- With a Little Help from your own Past: Prototypical Memory Networks for
Image Captioning [47.96387857237473]
We devise a network which can perform attention over activations obtained while processing other training samples.
Our memory models the distribution of past keys and values through the definition of prototype vectors.
We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training.
arXiv Detail & Related papers (2023-08-23T18:53:00Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Part-guided Relational Transformers for Fine-grained Visual Recognition [59.20531172172135]
We propose a framework to learn the discriminative part features and explore correlations with a feature transformation module.
Our proposed approach does not rely on additional part branches and reaches state-the-of-art performance on 3-of-the-level object recognition.
arXiv Detail & Related papers (2022-12-28T03:45:56Z) - Rethinking Batch Sample Relationships for Data Representation: A
Batch-Graph Transformer based Approach [16.757917001089762]
We design a simple yet flexible Batch-Graph Transformer (BGFormer) for mini-batch sample representations.
It deeply captures the relationships of image samples from both visual and semantic perspectives.
Extensive experiments on four popular datasets demonstrate the effectiveness of the proposed model.
arXiv Detail & Related papers (2022-11-19T08:46:50Z) - SIM-Trans: Structure Information Modeling Transformer for Fine-grained
Visual Categorization [59.732036564862796]
We propose the Structure Information Modeling Transformer (SIM-Trans) to incorporate object structure information into transformer for enhancing discriminative representation learning.
The proposed two modules are light-weighted and can be plugged into any transformer network and trained end-to-end easily.
Experiments and analyses demonstrate that the proposed SIM-Trans achieves state-of-the-art performance on fine-grained visual categorization benchmarks.
arXiv Detail & Related papers (2022-08-31T03:00:07Z) - Few-Shot Learning Meets Transformer: Unified Query-Support Transformers
for Few-Shot Classification [16.757917001089762]
Few-shot classification aims to recognize unseen classes using very limited samples.
In this paper, we show that the two challenges can be well modeled simultaneously via a unified Query-Support TransFormer model.
Experiments on four popular datasets demonstrate the effectiveness and superiority of the proposed QSFormer.
arXiv Detail & Related papers (2022-08-26T01:53:23Z) - Visual Transformer for Task-aware Active Learning [49.903358393660724]
We present a novel pipeline for pool-based Active Learning.
Our method exploits accessible unlabelled examples during training to estimate their co-relation with the labelled examples.
Visual Transformer models non-local visual concept dependency between labelled and unlabelled examples.
arXiv Detail & Related papers (2021-06-07T17:13:59Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - TransReID: Transformer-based Object Re-Identification [20.02035310635418]
Vision Transformer (ViT) is a pure transformer-based model for the object re-identification (ReID) task.
With several adaptations, a strong baseline ViT-BoT is constructed with ViT as backbone.
We propose a pure-transformer framework dubbed as TransReID, which is the first work to use a pure Transformer for ReID research.
arXiv Detail & Related papers (2021-02-08T17:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.