Fugu-MT 論文翻訳(概要): Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval

論文の概要: Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval

arxiv url: http://arxiv.org/abs/2604.03653v1
Date: Sat, 04 Apr 2026 09:05:14 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.701876
Title: Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval
Title（参考訳）: 拡散誘導型レジスターは、部分的に関連性のあるビデオ検索を可能にする
Authors: Jun Li, Xuhang Lou, Jinpeng Wang, Yuting Wang, Yaowei Wang, Shu-Tao Xia, Bin Chen,
Abstract要約: 部分関連ビデオ検索(PRVR)は、部分イベントのみを記述するテキストクエリに基づいて、未トリミングされたビデオを取得することを目的としている。本稿では,粗大な表現学習パラダイムを取り入れたDreamPRVRを提案する。
参考スコア（独自算出の注目度）: 74.31577742865488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these issues, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perception. The registers are then adaptively fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning. Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/CVPR26-DreamPRVR.
Abstract（参考訳）: 部分関連ビデオ検索(PRVR)は、部分イベントのみを記述するテキストクエリに基づいて、未トリミングされたビデオを取得することを目的としている。既存の手法は、不完全なグローバルな文脈認識に悩まされ、クエリのあいまいさと、刺激的な応答によって引き起こされる局所雑音に悩まされる。これらの課題に対処するために,粗大な表現学習パラダイムを採用したDreamPRVRを提案する。モデルはまず、ビデオ全体にわたる粗粒度ハイライトとしてグローバルな文脈意味レジスタを生成し、その後、正確なクロスモーダルマッチングのための微粒度類似度最適化に集中する。具体的には、確率的変分サンプリング器によって生成されたビデオ中心分布から初期化して、テキスト教師付きトランケート拡散モデルにより反復的に洗練することにより、これらのレジスタを生成する。この過程で、テキスト意味構造学習は、十分に構造化されたテキスト潜在空間を構築し、グローバルな知覚の信頼性を高める。レジスタは、レジスタ拡張されたガウスアテンションブロックを通じてビデオトークンと適応的に融合し、コンテキスト対応の機能学習を可能にする。大規模な実験により、DreamPRVRは最先端の手法よりも優れています。コードはhttps://github.com/lijun2005/CVPR26-DreamPRVRで公開されている。

論文の概要: Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval

関連論文リスト