Related papers: From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos

URL: http://arxiv.org/abs/2506.05274v1
Date: Thu, 05 Jun 2025 17:31:17 GMT
Title: From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos
Authors: Animesh Gupta, Jay Parmar, Ishan Rajendrakumar Dave, Mubarak Shah,
Abstract summary: Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change.<n>We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR.<n> TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving.
Score: 48.666667545084835
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Composed Video Retrieval (CoVR) retrieves a target video given a query video and a modification text describing the intended change. Existing CoVR benchmarks emphasize appearance shifts or coarse event changes and therefore do not test the ability to capture subtle, fast-paced temporal differences. We introduce TF-CoVR, the first large-scale benchmark dedicated to temporally fine-grained CoVR. TF-CoVR focuses on gymnastics and diving and provides 180K triplets drawn from FineGym and FineDiving. Previous CoVR benchmarks focusing on temporal aspect, link each query to a single target segment taken from the same video, limiting practical usefulness. In TF-CoVR, we instead construct each <query, modification> pair by prompting an LLM with the label differences between clips drawn from different videos; every pair is thus associated with multiple valid target videos (3.9 on average), reflecting real-world tasks such as sports-highlight generation. To model these temporal dynamics we propose TF-CoVR-Base, a concise two-stage training framework: (i) pre-train a video encoder on fine-grained action classification to obtain temporally discriminative embeddings; (ii) align the composed query with candidate videos using contrastive learning. We conduct the first comprehensive study of image, video, and general multimodal embedding (GME) models on temporally fine-grained composed retrieval in both zero-shot and fine-tuning regimes. On TF-CoVR, TF-CoVR-Base improves zero-shot mAP@50 from 5.92 (LanguageBind) to 7.51, and after fine-tuning raises the state-of-the-art from 19.83 to 25.82.

Related papers

PREGEN: Uncovering Latent Thoughts in Composed Video Retrieval [9.493866391853723]
Composed Video Retrieval (CoVR) aims to retrieve a video based on a query video and a modifying text.<n>Current CoVR methods fail to fully exploit modern Vision-Language Models (VLMs)<n>We introduce PREGEN, an efficient and powerful CoVR framework that overcomes these limitations.
arXiv Detail & Related papers (2026-01-20T09:57:04Z)
DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification [4.973664680272982]
DejaVid is an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture.<n>We introduce a new neural network architecture inspired by traditional time series alignment algorithms for this learning task.<n>Our evaluation demonstrates that DejaVid substantially improves the performance of a state-of-the-art large encoder.
arXiv Detail & Related papers (2025-06-14T17:39:03Z)
EgoCVR: An Egocentric Benchmark for Fine-Grained Composed Video Retrieval [52.375143786641196]
EgoCVR is an evaluation benchmark for fine-grained Composed Video Retrieval. EgoCVR consists of 2,295 queries that specifically focus on high-quality temporal video understanding.
arXiv Detail & Related papers (2024-07-23T17:19:23Z)
ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning [29.620990627792906]
This paper presents a new self-supervised video representation learning framework, ARVideo, which autoregressively predicts the next video token in a tailored sequence order. Extensive experiments establish ARVideo as an effective paradigm for self-supervised video representation learning.
arXiv Detail & Related papers (2024-05-24T02:29:03Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Partially Relevant Video Retrieval [39.747235541498135]
We propose a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR) PRVR aims to retrieve partially relevant videos from a large collection of untrimmed videos. We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames.
arXiv Detail & Related papers (2022-08-26T09:07:16Z)
Perceptual Learned Video Compression with Recurrent Conditional GAN [158.0726042755]
We propose a Perceptual Learned Video Compression (PLVC) approach with recurrent conditional generative adversarial network. PLVC learns to compress video towards good perceptual quality at low bit-rate. The user study further validates the outstanding perceptual performance of PLVC in comparison with the latest learned video compression approaches.
arXiv Detail & Related papers (2021-09-07T13:36:57Z)
Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration. This enables the learning of long-range dependencies beyond a single clip. Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)
Temporal Context Aggregation for Video Retrieval with Contrastive Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features. The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.