Impact of Language Guidance: A Reproducibility Study
- URL: http://arxiv.org/abs/2504.08140v1
- Date: Thu, 10 Apr 2025 21:59:13 GMT
- Title: Impact of Language Guidance: A Reproducibility Study
- Authors: Cherish Puniani, Advika Sinha, Shree Singhi, Aayan Yadav,
- Abstract summary: Recent advances in self-supervised learning allow us to train huge models without explicit annotation.<n>We use an off-the-shelf image captioning model, BLIP-2, to replace the captions and improve performance.<n>We also devise a new metric to evaluate the semantic capabilities of self-supervised models based on interpretability methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern deep-learning architectures need large amounts of data to produce state-of-the-art results. Annotating such huge datasets is time-consuming, expensive, and prone to human error. Recent advances in self-supervised learning allow us to train huge models without explicit annotation. Contrastive learning is a popular paradigm in self-supervised learning. Recent works like SimCLR and CLIP rely on image augmentations or directly minimizing cross-modal loss between image and text. Banani et al. (2023) propose to use language guidance to sample view pairs. They claim that language enables better conceptual similarity, eliminating the effects of visual variability. We reproduce their experiments to verify their claims and find that their dataset, RedCaps, contains low-quality captions. We use an off-the-shelf image captioning model, BLIP-2, to replace the captions and improve performance, and we also devise a new metric to evaluate the semantic capabilities of self-supervised models based on interpretability methods.
Related papers
- A Chain-of-Thought Subspace Meta-Learning for Few-shot Image Captioning with Large Vision and Language Models [17.144311122664508]
A large-scale vision and language model that has been pretrained on massive data encodes visual and linguistic prior.
We propose a chain-of-thought (CoT) meta-learning scheme as a multi-step image captioning procedure to better imitate how humans describe images.
arXiv Detail & Related papers (2025-02-19T18:35:43Z) - Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation [70.95783968368124]
We introduce a novel multi-modal autoregressive model, dubbed $textbfInstaManip$.<n>We propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages.<n>Our method surpasses previous few-shot image manipulation models by a notable margin.
arXiv Detail & Related papers (2024-12-02T01:19:21Z) - Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation
for Grounding-Based Vision and Language Models [16.4010094165575]
We propose a robust phrase grounding model trained with text-conditioned and text-unconditioned data augmentations.
Inspired by recent masked signal reconstruction, we propose to use pixel-level masking as a novel form of data augmentation.
Our method demonstrates advanced performance over the state-of-the-arts with various metrics.
arXiv Detail & Related papers (2023-11-05T01:14:02Z) - Visual Grounding Helps Learn Word Meanings in Low-Data Regimes [47.7950860342515]
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension.
But to achieve these results, LMs must be trained in distinctly un-human-like ways.
Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning?
We investigate this question in the context of word learning, a key sub-task in language acquisition.
arXiv Detail & Related papers (2023-10-20T03:33:36Z) - The Role of Data Curation in Image Captioning [26.61662352061468]
This paper contributes to this direction by actively curating difficult samples in datasets without increasing the total number of samples.
Experiments on the Flickr30K and COCO datasets with the BLIP and BEiT-3 models demonstrate that these curation methods do indeed yield improved image captioning models.
arXiv Detail & Related papers (2023-05-05T15:16:07Z) - Prompt-based Learning for Unpaired Image Captioning [86.44188293709307]
Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs.
Recent successes of Vision-Language Pre-Trained Models (VL-PTMs) have triggered the development of prompt-based learning.
We present in this paper a novel scheme based on prompt to train the UIC model, making best use of the powerful generalization ability.
arXiv Detail & Related papers (2022-05-26T03:13:43Z) - Meta-Learning and Self-Supervised Pretraining for Real World Image
Translation [5.469808405577674]
We explore image-to-image translation problem in order to formulate a novel multi-task few-shot image generation benchmark.
We present several baselines for the few-shotiru problem and discuss trade-offs between different approaches.
arXiv Detail & Related papers (2021-12-22T14:48:22Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - VidLanKD: Improving Language Understanding via Video-Distilled Knowledge
Transfer [76.3906723777229]
We present VidLanKD, a video-language knowledge distillation method for improving language understanding.
We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.
In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models.
arXiv Detail & Related papers (2021-07-06T15:41:32Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.