SiRi: A Simple Selective Retraining Mechanism for Transformer-based
Visual Grounding
- URL: http://arxiv.org/abs/2207.13325v1
- Date: Wed, 27 Jul 2022 07:01:01 GMT
- Title: SiRi: A Simple Selective Retraining Mechanism for Transformer-based
Visual Grounding
- Authors: Mengxue Qu, Yu Wu, Wu Liu, Qiqi Gong, Xiaodan Liang, Olga Russakovsky,
Yao Zhao, and Yunchao Wei
- Abstract summary: Selective Retraining (SiRi) can significantly outperform previous approaches on three popular benchmarks.
SiRi performs surprisingly superior even with limited training data.
We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity.
- Score: 131.0977050185209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate how to achieve better visual grounding with
modern vision-language transformers, and propose a simple yet powerful
Selective Retraining (SiRi) mechanism for this challenging task. Particularly,
SiRi conveys a significant principle to the research of visual grounding, i.e.,
a better initialized vision-language encoder would help the model converge to a
better local minimum, advancing the performance accordingly. In specific, we
continually update the parameters of the encoder as the training goes on, while
periodically re-initialize rest of the parameters to compel the model to be
better optimized based on an enhanced encoder. SiRi can significantly
outperform previous approaches on three popular benchmarks. Specifically, our
method achieves 83.04% Top1 accuracy on RefCOCO+ testA, outperforming the
state-of-the-art approaches (training from scratch) by more than 10.21%.
Additionally, we reveal that SiRi performs surprisingly superior even with
limited training data. We also extend it to transformer-based visual grounding
models and other vision-language tasks to verify the validity.
Related papers
- Addressing Sample Inefficiency in Multi-View Representation Learning [6.621303125642322]
Non-contrastive self-supervised learning (NC-SSL) methods have shown great promise for label-free representation learning in computer vision.
We provide theoretical insights on the implicit bias of the BarlowTwins and VICReg loss that can explain theses and guide the development of more principled recommendations.
arXiv Detail & Related papers (2023-12-17T14:14:31Z) - Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation.
Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z) - Efficient Training for Visual Tracking with Deformable Transformer [0.0]
We present DETRack, a streamlined end-to-end visual object tracking framework.
Our framework utilizes an efficient encoder-decoder structure where the deformable transformer decoder acting as a target head.
For training, we introduce a novel one-to-many label assignment and an auxiliary denoising technique.
arXiv Detail & Related papers (2023-09-06T03:07:43Z) - Parameter-Efficient Transfer Learning for Remote Sensing Image-Text
Retrieval [10.84733740863356]
In this work, we investigate the parameter-efficient transfer learning (PETL) method to transfer visual-language knowledge from the natural domain to the RS domain on the image-text retrieval task.
Our proposed model only contains 0.16M training parameters, which can achieve a parameter reduction of 98.9% compared to full fine-tuning.
Our retrieval performance exceeds traditional methods by 7-13% and achieves comparable or better performance than full fine-tuning.
arXiv Detail & Related papers (2023-08-24T02:43:53Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning
with Self-Knowledge Distillation [11.52842516726486]
We propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers.
We also introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model.
With language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models.
arXiv Detail & Related papers (2021-03-17T21:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.