A CLIP-Hitchhiker's Guide to Long Video Retrieval
- URL: http://arxiv.org/abs/2205.08508v1
- Date: Tue, 17 May 2022 17:26:23 GMT
- Title: A CLIP-Hitchhiker's Guide to Long Video Retrieval
- Authors: Max Bain, Arsha Nagrani, G\"ul Varol, Andrew Zisserman
- Abstract summary: We study the adaptation of image-text models for long video retrieval.
Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP.
We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement.
- Score: 84.36155238161462
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Our goal in this paper is the adaptation of image-text models for long video
retrieval. Recent works have demonstrated state-of-the-art performance in video
retrieval by adopting CLIP, effectively hitchhiking on the image-text
representation for video tasks. However, there has been limited success in
learning temporal aggregation that outperform mean-pooling the image-level
representations extracted per frame by CLIP. We find that the simple yet
effective baseline of weighted-mean of frame embeddings via query-scoring is a
significant improvement above all prior temporal modelling attempts and
mean-pooling. In doing so, we provide an improved baseline for others to
compare to and demonstrate state-of-the-art performance of this simple baseline
on a suite of long video retrieval benchmarks.
Related papers
- Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - VidLA: Video-Language Alignment at Scale [48.665918882615195]
We propose VidLA, an approach for video-language alignment at scale.
Our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks.
arXiv Detail & Related papers (2024-03-21T22:36:24Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Representation Recycling for Streaming Video Analysis [19.068248496174903]
StreamDEQ aims to infer frame-wise representations on videos with minimal per-frame computation.
We show that StreamDEQ is able to recover near-optimal representations in a few frames' time and maintain an up-to-date representation throughout the video duration.
arXiv Detail & Related papers (2022-04-28T13:35:14Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - Fine-Grained Instance-Level Sketch-Based Video Retrieval [159.12935292432743]
We propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR)
Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level.
We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.
arXiv Detail & Related papers (2020-02-21T18:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.