Prefix Conditioning Unifies Language and Label Supervision
- URL: http://arxiv.org/abs/2206.01125v2
- Date: Mon, 15 May 2023 18:42:57 GMT
- Title: Prefix Conditioning Unifies Language and Label Supervision
- Authors: Kuniaki Saito, Kihyuk Sohn, Xiang Zhang, Chun-Liang Li, Chen-Yu Lee,
Kate Saenko, Tomas Pfister
- Abstract summary: We show that dataset biases negatively affect pre-training by reducing the generalizability of learned representations.
In experiments, we show that this simple technique improves the performance in zero-shot image recognition accuracy and robustness to the image-level distribution shift.
- Score: 84.11127588805138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-classification datasets have been used to pretrain image recognition
models. Recently, web-scale image-caption datasets have emerged as a source of
powerful pretraining alternative. Image-caption datasets are more
``open-domain'', containing a wider variety of scene types and vocabulary words
than traditional classification datasets, and models trained on these datasets
have demonstrated strong performance on few- and zero-shot recognition tasks.
When naively unifying image-classification and -caption dataset, we show that
such dataset biases negatively affect pre-training by reducing the
generalizability of learned representations and thus jeopardizing zero-shot
performance since the unification can tailor the model for the classification
dataset, making it vulnerable to the distribution shift from the dataset. In
this work, we address the problem by disentangling the dataset bias using
prefix tokens that inform a language encoder of the type of the input dataset
(e.g., image-classification or caption) at training time. This approach allows
the language encoder to share the knowledge from two datasets as well as switch
the mode of feature extraction, i.e., image-classification dataset or
image-caption dataset tailored mode, where we use image-caption mode in the
zero-shot evaluation. Our method is generic and can be easily integrated into
existing VL pre-training objectives such as CLIP or UniCL. In experiments, we
show that this simple technique improves the performance in zero-shot image
recognition accuracy and robustness to the image-level distribution shift.
Related papers
- ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features.
It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set.
It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z) - ALIP: Adaptive Language-Image Pre-training with Synthetic Caption [78.93535202851278]
Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks.
The presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning.
We propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption.
arXiv Detail & Related papers (2023-08-16T15:19:52Z) - Diversify Your Vision Datasets with Automatic Diffusion-Based
Augmentation [66.6546668043249]
ALIA (Automated Language-guided Image Augmentation) is a method which utilizes large vision and language models to automatically generate natural language descriptions of a dataset's domains.
To maintain data integrity, a model trained on the original dataset filters out minimal image edits and those which corrupt class-relevant information.
We show that ALIA is able to surpasses traditional data augmentation and text-to-image generated data on fine-grained classification tasks.
arXiv Detail & Related papers (2023-05-25T17:43:05Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms.
We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance.
Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z) - Zero-Shot Text-to-Image Generation [15.135825501365007]
We describe a transformer that autoregressively models the text and image tokens as a single stream of data.
With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion.
arXiv Detail & Related papers (2021-02-24T06:42:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.