A Simple Long-Tailed Recognition Baseline via Vision-Language Model
- URL: http://arxiv.org/abs/2111.14745v1
- Date: Mon, 29 Nov 2021 17:49:24 GMT
- Title: A Simple Long-Tailed Recognition Baseline via Vision-Language Model
- Authors: Teli Ma, Shijie Geng, Mengmeng Wang, Jing Shao, Jiasen Lu, Hongsheng
Li, Peng Gao, Yu Qiao
- Abstract summary: The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
- Score: 92.2866546058082
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The visual world naturally exhibits a long-tailed distribution of open
classes, which poses great challenges to modern visual systems. Existing
approaches either perform class re-balancing strategies or directly improve
network modules to address the problem. However, they still train models with a
finite set of predefined labels, limiting their supervision information and
restricting their transferability to novel instances. Recent advances in
large-scale contrastive visual-language pretraining shed light on a new pathway
for visual recognition. With open-vocabulary supervisions, pretrained
contrastive vision-language models learn powerful multimodal representations
that are promising to handle data deficiency and unseen concepts. By
calculating the semantic similarity between visual and text inputs, visual
recognition is converted to a vision-language matching problem. Inspired by
this, we propose BALLAD to leverage contrastive vision-language models for
long-tailed recognition. We first continue pretraining the vision-language
backbone through contrastive learning on a specific long-tailed target dataset.
Afterward, we freeze the backbone and further employ an additional adapter
layer to enhance the representations of tail classes on balanced training
samples built with re-sampling strategies. Extensive experiments have been
conducted on three popular long-tailed recognition benchmarks. As a result, our
simple and effective approach sets the new state-of-the-art performances and
outperforms competitive baselines with a large margin. Code is released at
https://github.com/gaopengcuhk/BALLAD.
Related papers
- Croc: Pretraining Large Multimodal Models with Cross-Modal Comprehension [21.500920290909843]
We propose a new pretraining paradigm for Large Language Models (LLMs) to enhance their visual comprehension capabilities.
Specifically, we design a dynamically learnable prompt token pool and employ the Hungarian algorithm to replace part of the original visual tokens with the most relevant prompt tokens.
We present a new foundation model called Croc, which achieves new state-of-the-art performance on massive vision-language benchmarks.
arXiv Detail & Related papers (2024-10-18T09:44:25Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Contrastive Learning with Boosted Memorization [36.957895270908324]
Self-supervised learning has achieved a great success in the representation learning of visual and textual data.
Recent attempts to consider self-supervised long-tailed learning are made by rebalancing in the loss perspective or the model perspective.
We propose a novel Boosted Contrastive Learning (BCL) method to enhance the long-tailed learning in the label-unaware context.
arXiv Detail & Related papers (2022-05-25T11:54:22Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.