Fugu-MT 論文翻訳(概要): RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

論文の概要: RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

arxiv url: http://arxiv.org/abs/2605.15196v1
Date: Thu, 14 May 2026 17:59:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:35.020819
Title: RefDecoder: Enhancing Visual Generation with Conditional Video Decoding
Title（参考訳）: RefDecoder: 条件付きビデオデコーディングによるビジュアルジェネレーションの強化
Authors: Xiang Fan, Yuheng Wang, Bohan Fang, Zhongzheng Ren, Ranjay Krishna,
Abstract要約: RefDecoderは、参照アテンションを介してデコードプロセスに直接高忠実度参照画像信号を注入する参照条件付きVAEデコーダである。我々は、Inter4K、WebVid、Large Motion再構成ベンチマークの無条件ベースラインに対して、+2.1dB PSNRを達成し、いくつかのデコーダバックボーン間で一貫した改善を実証する。 RefDecoderは、追加の微調整なしで既存のビデオ生成システムと直接交換することができる。
参考スコア（独自算出の注目度）: 34.53947900093251
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Video generation powers a vast array of downstream applications. However, while the de facto standard, i.e., latent diffusion models, typically employ heavily conditioned denoising networks, their decoders often remain unconditional. We observe that this architectural asymmetry leads to significant loss of detail and inconsistency relative to the input image. To address this, we argue that the decoder requires equal conditioning to preserve structural integrity. We introduce RefDecoder, a reference-conditioned video VAE decoder by injecting high-fidelity reference image signal directly into the decoding process via reference attention. Specifically, a lightweight image encoder maps the reference frame into the detail-rich high-dimensional tokens, which are co-processed with the denoised video latent tokens at each decoder up-sampling stage. We demonstrate consistent improvements across several distinct decoder backbones (e.g., Wan 2.1 and VideoVAE+), achieving up to +2.1dB PSNR over the unconditional baselines on the Inter4K, WebVid, and Large Motion reconstruction benchmarks. Notably, RefDecoder can be directly swapped into existing video generation systems without additional fine-tuning, and we report across-the-board improvements in subject consistency, background consistency, and overall quality scores on the VBench I2V benchmark. Beyond I2V, RefDecoder generalizes well to a wide range of visual generation tasks such as style transfer and video editing refinement.
Abstract（参考訳）: ビデオ生成は、大量のダウンストリームアプリケーションを動かす。しかし、デファクト標準、すなわち遅延拡散モデル(英語版)は一般に重条件のデノナイジングネットワークを使用するが、デコーダはしばしば無条件のままである。この構造的非対称性は,入力画像に対して細部や不整合を著しく損なうことを観察する。この問題に対処するためには、デコーダは構造的整合性を維持するために等条件付けが必要であると論じる。本稿では,参照注意によるデコードプロセスに直接高忠実度参照画像信号を注入することにより,参照条件付きビデオVAEデコーダRefDecoderを紹介する。具体的には、軽量画像エンコーダは、参照フレームをディテールリッチな高次元トークンにマッピングし、各デコーダアップサンプリングステージにおける復号化ビデオ潜在トークンと共処理する。我々は、Inter4K、WebVid、Large Motion再構成ベンチマークの無条件ベースラインに対して、複数の異なるデコーダバックボーン(例: Wan 2.1 と VideoVAE+)で一貫した改善を示し、+2.1dB PSNRを実現した。特筆すべきは、RefDecoderを既存のビデオ生成システムに直接切り替えることができ、VBench I2Vベンチマークにおいて、対象の一貫性、背景の整合性、全体的な品質スコアが改善したことを報告している。 I2V以外にも、RefDecoderはスタイル転送やビデオ編集の改良など、幅広い視覚生成タスクを一般化している。

論文の概要: RefDecoder: Enhancing Visual Generation with Conditional Video Decoding

関連論文リスト