"This is my unicorn, Fluffy": Personalizing frozen vision-language
representations
- URL: http://arxiv.org/abs/2204.01694v1
- Date: Mon, 4 Apr 2022 17:58:11 GMT
- Title: "This is my unicorn, Fluffy": Personalizing frozen vision-language
representations
- Authors: Niv Cohen, Rinon Gal, Eli A. Meirom, Gal Chechik, Yuval Atzmon
- Abstract summary: We introduce a new learning setup called Personalized Vision & Language (PerVL)
In PerVL, one should learn personalized concepts independently of the downstream task.
We demonstrate that our approach learns personalized visual concepts from a few examples and can effectively apply them in image retrieval and semantic segmentation.
- Score: 31.618829097336047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Vision & Language models pretrained on web-scale data provide
representations that are invaluable for numerous V&L problems. However, it is
unclear how they can be used for reasoning about user-specific visual concepts
in unstructured language. This problem arises in multiple domains, from
personalized image retrieval to personalized interaction with smart devices. We
introduce a new learning setup called Personalized Vision & Language (PerVL)
with two new benchmark datasets for retrieving and segmenting user-specific
"personalized" concepts "in the wild". In PerVL, one should learn personalized
concepts (1) independently of the downstream task (2) allowing a pretrained
model to reason about them with free language, and (3) does not require
personalized negative examples. We propose an architecture for solving PerVL
that operates by extending the input vocabulary of a pretrained model with new
word embeddings for the new personalized concepts. The model can then reason
about them by simply using them in a sentence. We demonstrate that our approach
learns personalized visual concepts from a few examples and can effectively
apply them in image retrieval and semantic segmentation using rich textual
queries.
Related papers
- MyVLM: Personalizing VLMs for User-Specific Queries [78.33252556805931]
We take a first step toward the personalization of vision-language models, enabling them to learn and reason over user-provided concepts.
To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model.
Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM.
This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response.
arXiv Detail & Related papers (2024-03-21T17:51:01Z) - User-Aware Prefix-Tuning is a Good Learner for Personalized Image
Captioning [35.211749514733846]
Traditional image captioning methods often overlook the preferences and characteristics of users.
Most existing methods emphasize the user context fusion process by memory networks or transformers.
We propose a novel personalized image captioning framework that leverages user context to consider personality factors.
arXiv Detail & Related papers (2023-12-08T02:08:00Z) - Designing an Encoder for Fast Personalization of Text-to-Image Models [57.62449900121022]
We propose an encoder-based domain-tuning approach for text-to-image personalization.
We employ two components: First, an encoder that takes as an input a single image of a target concept from a given domain.
Second, a set of regularized weight-offsets for the text-to-image model that learn how to effectively ingest additional concepts.
arXiv Detail & Related papers (2023-02-23T18:46:41Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Unifying Vision-Language Representation Space with Single-tower
Transformer [29.604520441315135]
We train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner.
We discover intriguing properties that distinguish OneR from the previous works that learn modality-specific representation spaces.
arXiv Detail & Related papers (2022-11-21T02:34:21Z) - Pix2Struct: Screenshot Parsing as Pretraining for Visual Language
Understanding [58.70423899829642]
We present Pix2Struct, a pretrained image-to-text model for purely visual language understanding.
We show that a single pretrained model can achieve state-of-the-art results in six out of nine tasks across four domains.
arXiv Detail & Related papers (2022-10-07T06:42:06Z) - Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z) - Rich Semantics Improve Few-shot Learning [49.11659525563236]
We show that by using 'class-level' language descriptions, that can be acquired with minimal annotation cost, we can improve the few-shot learning performance.
We develop a Transformer based forward and backward encoding mechanism to relate visual and semantic tokens.
arXiv Detail & Related papers (2021-04-26T16:48:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.