Caption supervision enables robust learners
- URL: http://arxiv.org/abs/2210.07396v1
- Date: Thu, 13 Oct 2022 22:29:10 GMT
- Title: Caption supervision enables robust learners
- Authors: Benjamin Feuer, Ameya Joshi, Chinmay Hegde
- Abstract summary: We show that CNNs trained on a standard cross-entropy loss can also benefit from caption supervision, in some cases even more than VL models, on the same data.
To facilitate future experiments with high-accuracy caption-supervised models, we introduce CaptionNet.
In a series of experiments on CaptionNet, we show how the choice of loss function, data filtration and supervision strategy enable robust computer vision.
- Score: 24.936204628969623
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision language models like CLIP are robust to natural distribution shifts,
in part because CLIP learns on unstructured data using a technique called
caption supervision; the model inteprets image-linked texts as ground-truth
labels. In a carefully controlled comparison study, we show that CNNs trained
on a standard cross-entropy loss can also benefit from caption supervision, in
some cases even more than VL models, on the same data. To facilitate future
experiments with high-accuracy caption-supervised models, we introduce
CaptionNet (https://github.com/penfever/CaptionNet/), which includes a
class-balanced, fully supervised dataset with over 50,000 new human-labeled
ImageNet-compliant samples which includes web-scraped captions. In a series of
experiments on CaptionNet, we show how the choice of loss function, data
filtration and supervision strategy enable robust computer vision. We also
provide the codebase necessary to reproduce our experiments at
https://github.com/penfever/vlhub/
Related papers
- Impact of Language Guidance: A Reproducibility Study [0.0]
Recent advances in self-supervised learning allow us to train huge models without explicit annotation.
We use an off-the-shelf image captioning model, BLIP-2, to replace the captions and improve performance.
We also devise a new metric to evaluate the semantic capabilities of self-supervised models based on interpretability methods.
arXiv Detail & Related papers (2025-04-10T21:59:13Z) - RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning [69.23782518456932]
We propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA)
We bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2.
To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE.
arXiv Detail & Related papers (2024-05-11T16:22:00Z) - Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts [22.74552390076515]
We probe the representation spaces of 16 robust zero-shot CLIP vision encoders with various backbones and pretraining sets.
We detect the presence of outlier features in robust zero-shot CLIP vision encoders, which to the best of our knowledge is the first time these are observed in non-transformer models.
We find the existence of outlier features to be an indication of ImageNet shift robustness in models, since we only find them in robust models in our analysis.
arXiv Detail & Related papers (2023-10-19T17:59:12Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - Is a Caption Worth a Thousand Images? A Controlled Study for
Representation Learning [88.5382122413913]
We study whether language supervision can result in vision models with more transferable representations than traditional image-only methods.
We find that image-only methods do not match CLIP's transfer performance, even when they are trained with more image data.
Motivated by our findings, we devise simple prescriptions to enable CLIP to better leverage the language information present in existing pre-training datasets.
arXiv Detail & Related papers (2022-07-15T17:50:51Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - Identifying and Compensating for Feature Deviation in Imbalanced Deep
Learning [59.65752299209042]
We investigate learning a ConvNet under such a scenario.
We found that a ConvNet significantly over-fits the minor classes.
We propose to incorporate class-dependent temperatures (CDT) training ConvNet.
arXiv Detail & Related papers (2020-01-06T03:52:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.