A Straightforward Framework For Video Retrieval Using CLIP
- URL: http://arxiv.org/abs/2102.12443v2
- Date: Fri, 26 Feb 2021 17:55:07 GMT
- Title: A Straightforward Framework For Video Retrieval Using CLIP
- Authors: Jes\'us Andr\'es Portillo-Quintero, Jos\'e Carlos Ortiz-Bayliss, Hugo
Terashima-Mar\'in
- Abstract summary: Video Retrieval is a challenging task where a text query is matched to a video or vice versa.
Most of the existing approaches for addressing such a problem rely on annotations made by the users.
In this work, we explore the application of the language-image model, CLIP, to obtain video representations without the need for said annotations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Retrieval is a challenging task where a text query is matched to a
video or vice versa. Most of the existing approaches for addressing such a
problem rely on annotations made by the users. Although simple, this approach
is not always feasible in practice. In this work, we explore the application of
the language-image model, CLIP, to obtain video representations without the
need for said annotations. This model was explicitly trained to learn a common
space where images and text can be compared. Using various techniques described
in this document, we extended its application to videos, obtaining
state-of-the-art results on the MSR-VTT and MSVD benchmarks.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Composed Video Retrieval via Enriched Context and Discriminative Embeddings [118.66322242183249]
Composed video retrieval (CoVR) is a challenging problem in computer vision.
We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information.
Our approach achieves gains as high as around 7% in terms of recall@K=1 score.
arXiv Detail & Related papers (2024-03-25T17:59:03Z) - In-Style: Bridging Text and Uncurated Videos with Style Transfer for
Text-Video Retrieval [72.98185525653504]
We propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos.
To improve generalization, we show that one model can be trained with multiple text styles.
We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework.
arXiv Detail & Related papers (2023-09-16T08:48:21Z) - Multi-event Video-Text Retrieval [33.470499262092105]
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet.
We introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events.
We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task.
arXiv Detail & Related papers (2023-08-22T16:32:46Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - Transcript to Video: Efficient Clip Sequencing from Texts [65.87890762420922]
We present Transcript-to-Video -- a weakly-supervised framework that uses texts as input to automatically create video sequences from an extensive collection of shots.
Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to learn visual-language representations and model shot sequencing styles.
For fast inference, we introduce an efficient search strategy for real-time video clip sequencing.
arXiv Detail & Related papers (2021-07-25T17:24:50Z) - Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations.
We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text.
We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.