Meta-Personalizing Vision-Language Models to Find Named Instances in
Video
- URL: http://arxiv.org/abs/2306.10169v1
- Date: Fri, 16 Jun 2023 20:12:11 GMT
- Title: Meta-Personalizing Vision-Language Models to Find Named Instances in
Video
- Authors: Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron,
Simon Jenni
- Abstract summary: Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications.
They currently struggle with personalized searches for moments in a video where a specific object instance such as My dog Biscuit'' appears.
We present a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video.
- Score: 30.63415402318075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale vision-language models (VLM) have shown impressive results for
language-guided search applications. While these models allow category-level
queries, they currently struggle with personalized searches for moments in a
video where a specific object instance such as ``My dog Biscuit'' appears. We
present the following three contributions to address this problem. First, we
describe a method to meta-personalize a pre-trained VLM, i.e., learning how to
learn to personalize a VLM at test time to search in video. Our method extends
the VLM's token vocabulary by learning novel word embeddings specific to each
instance. To capture only instance-specific features, we represent each
instance embedding as a combination of shared and learned global category
features. Second, we propose to learn such personalization without explicit
human supervision. Our approach automatically identifies moments of named
visual instances in video using transcripts and vision-language similarity in
the VLM's embedding space. Finally, we introduce This-Is-My, a personal video
instance retrieval benchmark. We evaluate our approach on This-Is-My and
DeepFashion2 and show that we obtain a 15% relative improvement over the state
of the art on the latter dataset.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Open Vocabulary Multi-Label Video Classification [45.722133656740446]
We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task.
We propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes.
arXiv Detail & Related papers (2024-07-12T07:53:54Z) - Déjà Vu Memorization in Vision-Language Models [39.51189095703773]
We propose a new method for measuring memorization in Vision-Language Models (VLMs)
We show that the model indeed retains information about individual objects in the training images beyond what can be inferred from correlations or the image caption.
We evaluate d'eja vu memorization at both sample and population level, and show that it is significant for OpenCLIP trained on as many as 50M image-caption pairs.
arXiv Detail & Related papers (2024-02-03T09:55:35Z) - Videoprompter: an ensemble of foundational models for zero-shot video
understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations.
We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Video Moment Localization using Object Evidence and Reverse Captioning [1.1549572298362785]
We address the problem of language-based temporal localization of moments in untrimmed videos.
Current state-of-the-art model MAC addresses it by mining activity concepts from both video and language modalities.
We propose "Multi-faceted VideoMoment Localizer" (MML), an extension of MAC model by the introduction of visual object evidence.
arXiv Detail & Related papers (2020-06-18T03:45:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.