FuguReport

Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval

Authors Jun Li, Xuhang Lou, Jinpeng Wang, Yuting Wang, Yaowei Wang, Shu-Tao Xia, Bin Chen
Affiliations Harbin Institute of Technology / Peng Cheng Laboratory / Tsinghua University
Categories Application / Video Retrieval / Retrieval of untrimmed videos with partial event queries, Method / Representation Learning / Coarse representation learning for video retrieval, Method / Diffusion Models / Diffusion-guided enhancement in retrieval tasks
License CC BY 4.0

Abstract Overview

This paper addresses Partially Relevant Video Retrieval (PRVR), where a text query describes only a segment of an untrimmed video, making retrieval susceptible to spurious local matches. The proposed DreamPRVR framework adopts a coarse-to-fine strategy: it first generates global semantic registers for each video using a text-supervised truncated diffusion process initialized from a video-centric probabilistic distribution, then fuses those registers with frame- and clip-level tokens via register-augmented Gaussian attention to improve fine-grained cross-modal matching. The method also introduces textual semantic structure learning, combining a query diversity loss with a query similarity preservation loss so that queries from the same video remain semantically coherent while queries from different videos stay separable. A Textual Perturbation Sampler models query uncertainty to provide supervision targets for register generation.

Novelty

The key novelty is using diffusion-generated register tokens as explicit global context for partially relevant video retrieval, replacing approaches that rely solely on local clip matching or training-time-only regularization. The method couples this with a structured textual latent space (via query similarity preservation and diversity losses) and a video-centric probabilistic initialization, yielding a lightweight truncated diffusion design that provides global contextual cues during both training and inference.

Results

On ActivityNet Captions, Charades-STA, and TVR, DreamPRVR achieves the highest reported SumR values among all compared methods: 156.1, 80.0, and 193.1, respectively (Table 1). Ablation studies (Table 3) show that removing registers, diffusion refinement, video-centric initialization, or textual semantic structure learning consistently reduces retrieval performance across all three benchmarks. Efficiency analysis on Charades-STA (Table 2) indicates modest additional overhead relative to strong baselines such as HLFormer, with comparable retrieval time.

Key Points

  1. DreamPRVR addresses PRVR with a coarse-to-fine pipeline that first generates holistic video-context registers via a truncated diffusion process, then uses them to enhance local frame- and clip-level cross-modal alignment.
  2. Register generation combines textual semantic structure learning (query diversity and query similarity preservation losses), a probabilistic variational sampler for video-centric initialization, and an iterative diffusion register estimator, with the resulting registers fused into video tokens through register-augmented Gaussian attention.
  3. Experiments on three benchmarks show state-of-the-art SumR scores, and ablations confirm that each component—registers, diffusion refinement, video-centric initialization, and textual structure learning—contributes to improved retrieval while maintaining acceptable computational overhead.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.