Learning Visual Representations via Language-Guided Sampling
- URL: http://arxiv.org/abs/2302.12248v2
- Date: Wed, 29 Mar 2023 10:23:40 GMT
- Title: Learning Visual Representations via Language-Guided Sampling
- Authors: Mohamed El Banani, Karan Desai, Justin Johnson
- Abstract summary: We use language similarity to sample semantically similar image pairs for contrastive learning.
Our approach diverges from image-based contrastive learning by sampling view pairs using language similarity.
We show that language-guided learning yields better features than image-based and image-text representation learning approaches.
- Score: 25.117909306792324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although an object may appear in numerous contexts, we often describe it in a
limited number of ways. Language allows us to abstract away visual variation to
represent and communicate concepts. Building on this intuition, we propose an
alternative approach to visual representation learning: using language
similarity to sample semantically similar image pairs for contrastive learning.
Our approach diverges from image-based contrastive learning by sampling view
pairs using language similarity instead of hand-crafted augmentations or
learned clusters. Our approach also differs from image-text contrastive
learning by relying on pre-trained language models to guide the learning rather
than directly minimizing a cross-modal loss. Through a series of experiments,
we show that language-guided learning yields better features than image-based
and image-text representation learning approaches.
Related papers
- Compositional Entailment Learning for Hyperbolic Vision-Language Models [54.41927525264365]
We show how to fully leverage the innate hierarchical nature of hyperbolic embeddings by looking beyond individual image-text pairs.
We propose Compositional Entailment Learning for hyperbolic vision-language models.
Empirical evaluation on a hyperbolic vision-language model trained with millions of image-text pairs shows that the proposed compositional learning approach outperforms conventional Euclidean CLIP learning.
arXiv Detail & Related papers (2024-10-09T14:12:50Z) - TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models [14.019349267520541]
We propose a novel method that leverages the capabilities of language models to interpret the learned features of pre-trained image classifiers.
Our approach generates a vast number of sentences to explain the features learned by the classifier for a given image.
Our method, for the first time, utilizes these frequent words corresponding to a visual representation to provide insights into the decision-making process.
arXiv Detail & Related papers (2023-09-01T20:59:46Z) - Cross-Modal Concept Learning and Inference for Vision-Language Models [31.463771883036607]
In existing fine-tuning methods, the class-specific text description is matched against the whole image.
We develop a new method called cross-model concept learning and inference (CCLI)
Our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts.
arXiv Detail & Related papers (2023-07-28T10:26:28Z) - Style-Aware Contrastive Learning for Multi-Style Image Captioning [36.1319565907582]
We present a style-aware visual encoder with contrastive learning to mine potential visual content relevant to style.
We also propose a style-aware triplet contrast objective to distinguish whether the image, style and caption matched.
Experimental results demonstrate that our approach achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-01-26T19:21:39Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery.
We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z) - Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.