Text as Image: Learning Transferable Adapter for Multi-Label
Classification
- URL: http://arxiv.org/abs/2312.04160v1
- Date: Thu, 7 Dec 2023 09:22:20 GMT
- Title: Text as Image: Learning Transferable Adapter for Multi-Label
Classification
- Authors: Xuelin Zhu, Jiuxin Cao, Jian liu, Dongqi Tang, Furong Xu, Weijia Liu,
Jiawei Ge, Bo Liu, Qingpei Guo, Tianyi Zhang
- Abstract summary: We introduce an effective approach to employ large language models for multi-label instruction-following text generation.
In this way, a fully automated pipeline for visual label recognition is developed without relying on any manual data.
- Score: 13.11583340598517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained vision-language models have notably accelerated progress of
open-world concept recognition. Their impressive zero-shot ability has recently
been transferred to multi-label image classification via prompt tuning,
enabling to discover novel labels in an open-vocabulary manner. However, this
paradigm suffers from non-trivial training costs, and becomes computationally
prohibitive for a large number of candidate labels. To address this issue, we
note that vision-language pre-training aligns images and texts in a unified
embedding space, making it potential for an adapter network to identify labels
in visual modality while be trained in text modality. To enhance such
cross-modal transfer ability, a simple yet effective method termed random
perturbation is proposed, which enables the adapter to search for potential
visual embeddings by perturbing text embeddings with noise during training,
resulting in better performance in visual modality. Furthermore, we introduce
an effective approach to employ large language models for multi-label
instruction-following text generation. In this way, a fully automated pipeline
for visual label recognition is developed without relying on any manual data.
Extensive experiments on public benchmarks show the superiority of our method
in various multi-label classification tasks.
Related papers
- DualCoOp++: Fast and Effective Adaptation to Multi-Label Recognition
with Limited Annotations [79.433122872973]
Multi-label image recognition in the low-label regime is a task of great challenge and practical significance.
We leverage the powerful alignment between textual and visual features pretrained with millions of auxiliary image-text pairs.
We introduce an efficient and effective framework called Evidence-guided Dual Context Optimization (DualCoOp++)
arXiv Detail & Related papers (2023-08-03T17:33:20Z) - LPN: Language-guided Prototypical Network for few-shot classification [16.37959398470535]
Few-shot classification aims to adapt to new tasks with limited labeled examples.
Recent methods explore suitable measures for the similarity between the query and support images.
We propose a Language-guided Prototypical Network (LPN) for few-shot classification.
arXiv Detail & Related papers (2023-07-04T06:54:01Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - PromptDet: Expand Your Detector Vocabulary with Uncurated Images [47.600059694034]
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations.
We propose a two-stage open-vocabulary object detector that categorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model.
To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images.
arXiv Detail & Related papers (2022-03-30T17:50:21Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z) - Learning to Prompt for Vision-Language Models [82.25005817904027]
Vision-language pre-training has emerged as a promising alternative for representation learning.
It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders.
Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
arXiv Detail & Related papers (2021-09-02T17:57:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.