Fugu-MT 論文翻訳(概要): VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation

論文の概要: VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation

arxiv url: http://arxiv.org/abs/2605.30317v1
Date: Thu, 28 May 2026 17:55:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.653349
Title: VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation
Title（参考訳）: VPG: 自動回帰画像とビデオ生成のためのビジュアルプリフィックスガイダンス
Authors: Xinyao Liao, Qiyuan He, Yicong Li, Jiayin Zhu, Xiaoye Qu, Wei Wei, Angela Yao,
Abstract要約: ビジュアル・プレフィックス・ガイダンス(英語: Visual Prefix Guidance, VPG)は、自動回帰画像とビデオ生成のためのトレーニング不要な推論時間誘導手法である。 VPGは、生成されたプレフィックスの下のモデルの出力と、劣化したプレフィックスの下の出力とを対比することにより、次のステップ予測を改善する。
参考スコア（独自算出の注目度）: 61.564370002191744
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive image and video generators are trained with teacher-forced histories but must sample from their own generated prefixes at inference time, making them vulnerable to exposure bias and prefix drift. Existing remedies either modify training or apply sampling-time guidance aimed primarily at external semantic conditions, such as class labels or text prompts, rather than testing whether a next-step prediction provides strong posterior support for the generated prefix itself. We propose Visual Prefix Guidance (VPG), a training-free inference-time guidance method for autoregressive image and video generation. VPG improves next-step prediction by contrasting the model's output under the generated prefix with its output under a corrupted prefix, then extrapolating logits toward candidates that strengthen the posterior support of the generated prefix. Across class-conditional image generation with VAR, text-to-image generation with Infinity, and text-to-video generation with InfinityStar, VPG improves generation quality without retraining the base model, reducing FID on VAR by 0.36 on average and improving benchmark performance on both image and video generation.
Abstract（参考訳）: 自己回帰画像とビデオジェネレータは教師力の履歴で訓練されるが、推論時に生成されたプレフィックスからサンプリングする必要があるため、露出バイアスやプレフィックスドリフトに弱い。既存の改善策は、トレーニングを変更するか、クラスラベルやテキストプロンプトなど、主に外部のセマンティックな条件を対象としたサンプリングタイムガイダンスを適用するかのいずれかで、次のステップの予測が生成されたプレフィックス自体に対して強力な後続サポートを提供するかどうかをテストする。本稿では,自動回帰画像とビデオ生成のためのトレーニング不要な推論時間誘導手法であるVisual Prefix Guidance (VPG)を提案する。 VPGは、生成したプレフィックスの下のモデルの出力と、劣化したプレフィックスの下の出力とを比較して、生成したプレフィックスの後方サポートを強化する候補に対するロジットを外挿することで、次のステップ予測を改善する。 VARによるクラス条件画像生成、InfinityStarによるテキスト・ツー・イメージ生成、InfinityStarによるテキスト・ツー・ビデオ生成、VPGはベースモデルを再トレーニングすることなく生成品質を改善し、VAR上のFIDを平均0.36削減し、画像およびビデオ生成におけるベンチマーク性能を向上させる。

論文の概要: VPG: Visual Prefix Guidance for Autoregressive Image and Video Generation

関連論文リスト