Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval
- URL: http://arxiv.org/abs/2510.12283v1
- Date: Tue, 14 Oct 2025 08:38:20 GMT
- Title: Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval
- Authors: Jianfeng Dong, Lei Huang, Daizong Liu, Xianke Chen, Xun Yang, Changting Lin, Xun Wang, Meng Wang,
- Abstract summary: In practice, videos are typically untrimmed in long durations with much more complicated background content.<n>We propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model.<n>Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets.
- Score: 53.54695034420311
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Almost all previous text-to-video retrieval works ideally assume that videos are pre-trimmed with short durations containing solely text-related content. However, in practice, videos are typically untrimmed in long durations with much more complicated background content. Therefore, in this paper, we focus on the more practical yet challenging task of Partially Relevant Video Retrieval (PRVR), which aims to retrieve partially relevant untrimmed videos with the given query. To tackle this task, we propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model and transfers it to a lightweight, task-specific PRVR network. Specifically, we introduce a Dual Learning framework with Dynamic Knowledge Distillation (DL-DKD++), where a large teacher model provides supervision to a compact dual-branch student network. The student model comprises two branches: an inheritance branch that absorbs transferable knowledge from the teacher, and an exploration branch that learns task-specific information from the PRVR dataset to address domain gaps. To further enhance learning, we incorporate a dynamic soft-target construction mechanism. By replacing rigid hard-target supervision with adaptive soft targets that evolve during training, our method enables the model to better capture the fine-grained, partial relevance between videos and queries. Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets for PRVR. The code is available at https://github.com/HuiGuanLab/DL-DKD.
Related papers
- VideoWorld 2: Learning Transferable Knowledge from Real-world Videos [128.28225837148713]
VideoWorld 2 offers the first investigation into learning transferable knowledge directly from raw real-world videos.<n>VideoWorld 2 learns latent codes that focus on compact and meaningful task-related dynamics.<n>Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate.
arXiv Detail & Related papers (2026-02-10T18:58:19Z) - Unsupervised Video Class-Incremental Learning via Deep Embedded Clustering Management [47.53991869205973]
Unsupervised video class incremental learning (uVCIL) represents an important learning paradigm for learning video information without forgetting.<n>We propose a simple yet effective approach to address the uVCIL.<n>We first consider a deep feature extractor network, providing a set of representative video features during each task without assuming any class or task information.
arXiv Detail & Related papers (2026-01-20T15:25:41Z) - Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering [47.53991869205973]
We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks.<n>We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning.
arXiv Detail & Related papers (2025-08-29T16:49:03Z) - Perception Encoder: The best visual embeddings are not at the output of the network [70.86738083862099]
We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning.<n>We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks.<n>Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
arXiv Detail & Related papers (2025-04-17T17:59:57Z) - VideoWorld: Exploring Knowledge Learning from Unlabeled Videos [119.35107657321902]
This work explores whether a deep generative model can learn complex knowledge solely from visual input.<n>We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks.
arXiv Detail & Related papers (2025-01-16T18:59:10Z) - Learning from One Continuous Video Stream [70.30084026960819]
We introduce a framework for online learning from a single continuous video stream.
This poses great challenges given the high correlation between consecutive video frames.
We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation.
arXiv Detail & Related papers (2023-12-01T14:03:30Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - VLG: General Video Recognition with Web Textual Knowledge [47.3660792813967]
We focus on the general video recognition (GVR) problem of solving different recognition tasks within a unified framework.
By leveraging semantic knowledge from noisy text descriptions crawled from the Internet, we present a unified visual-linguistic framework (VLG)
Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings.
arXiv Detail & Related papers (2022-12-03T15:46:49Z) - Enabling Weakly-Supervised Temporal Action Localization from On-Device
Learning of the Video Stream [5.215681853828831]
We propose an efficient video learning approach to learn from a long, untrimmed streaming video.
To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.
arXiv Detail & Related papers (2022-08-25T13:41:03Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.