Fugu-MT 論文翻訳(概要): LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion

論文の概要: LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion

arxiv url: http://arxiv.org/abs/2603.14526v1
Date: Sun, 15 Mar 2026 18:07:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.868373
Title: LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion
Title（参考訳）: LatSearch: ビデオ拡散における高速な推論時間スケーリングのための遅延リワードガイド検索
Authors: Zengqun Zhao, Ziquan Liu, Yu Cao, Shaogang Gong, Zhensong Zhang, Jifei Song, Jiankang Deng, Ioannis Patras,
Abstract要約: 本稿では,Reward-Guided Resampling and Pruningを実行する新しい推論時間探索機構を提案する。 LatSearchは、ベースラインのWan2.1モデルと比較して、複数の評価次元にわたるビデオ生成を一貫して改善する。
参考スコア（独自算出の注目度）: 87.42285185305813
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent success of inference-time scaling in large language models has inspired similar explorations in video diffusion. In particular, motivated by the existence of "golden noise" that enhances video quality, prior work has attempted to improve inference by optimising or searching for better initial noise. However, these approaches have notable limitations: they either rely on priors imposed at the beginning of noise sampling or on rewards evaluated only on the denoised and decoded videos. This leads to error accumulation, delayed and sparse reward signals, and prohibitive computational cost, which prevents the use of stronger search algorithms. Crucially, stronger search algorithms are precisely what could unlock substantial gains in controllability, sample efficiency and generation quality for video diffusion, provided their computational cost can be reduced. To fill in this gap, we enable efficient inference-time scaling for video diffusion through latent reward guidance, which provides intermediate, informative and efficient feedback along the denoising trajectory. We introduce a latent reward model that scores partially denoised latents at arbitrary timesteps with respect to visual quality, motion quality, and text alignment. Building on this model, we propose LatSearch, a novel inference-time search mechanism that performs Reward-Guided Resampling and Pruning (RGRP). In the resampling stage, candidates are sampled according to reward-normalised probabilities to reduce over-reliance on the reward model. In the pruning stage, applied at the final scheduled step, only the candidate with the highest cumulative reward is retained, improving both quality and efficiency. We evaluate LatSearch on the VBench-2.0 benchmark and demonstrate that it consistently improves video generation across multiple evaluation dimensions compared to the baseline Wan2.1 model.
Abstract（参考訳）: 近年の大規模言語モデルにおける推論時間スケーリングの成功は、ビデオ拡散における同様の探索にインスピレーションを与えている。特に、映像品質を高める「金音」の存在に動機づけられた先行研究は、より優れた初期ノイズを最適化または探索することで推論を改善しようと試みている。しかし、これらのアプローチには顕著な制限があり、ノイズサンプリングの開始時に課された事前や、復号化および復号化ビデオにのみ評価される報酬に依存する。これにより、エラーの蓄積、遅延およびスパース報酬信号、計算コストの禁止が実現され、より強力な検索アルゴリズムの使用が防止される。重要なことに、より強力な検索アルゴリズムは、その計算コストを削減できるならば、制御性、サンプル効率、ビデオ拡散の生成品質を大幅に向上させることができる。このギャップを埋めるため、遅延報酬誘導による映像拡散の効率的な推論時間スケーリングを実現し、聴覚軌道に沿った中間的、情報的、効率的なフィードバックを提供する。視覚的品質, 動作品質, テキストアライメントに関して, 任意のタイミングで部分的に認知された潜在者をスコアする潜在報酬モデルを導入する。このモデルに基づいて,Reward-Guided Resampling and Pruning (RGRP) を実行する新しい推論時検索機構であるLatSearchを提案する。再サンプリング段階では、報酬正規化確率に応じて候補をサンプリングし、報酬モデルに対する過度な信頼を減らす。最終計画段階で適用される刈取段階では、最高累積報酬を有する候補のみが保持され、品質と効率が向上する。我々は,VBench-2.0ベンチマークでLatSearchを評価し,ベースラインであるWan2.1モデルと比較して,複数の評価次元にわたる映像生成を一貫して改善することを示した。

論文の概要: LatSearch: Latent Reward-Guided Search for Faster Inference-Time Scaling in Video Diffusion

関連論文リスト