Fugu-MT 論文翻訳(概要): Keep The Essentials: Efficient Reference Conditioned Generation via Token Dropping

論文の概要: Keep The Essentials: Efficient Reference Conditioned Generation via Token Dropping

arxiv url: http://arxiv.org/abs/2606.23682v1
Date: Mon, 22 Jun 2026 17:59:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 17:09:58.318879
Title: Keep The Essentials: Efficient Reference Conditioned Generation via Token Dropping
Title（参考訳）: 基本性を維持する: トークンドロップによる効率的な参照条件生成
Authors: Rishubh Parihar, Ayush Raina, R. Venkatesh Babu, Or Patashnik,
Abstract要約: 本稿では,参照トークンの少ないサブセットのみを保持することでスパース参照表現を構築する方法であるスパースコンテキストを提案する。モデルを変更することなく、推論時に参照トークンのかなりの部分を落としてしまうと、その生成能力は大きく保たれます。提案手法は,複数参照生成のための推論速度を4倍に向上し,単一参照生成のための2倍に向上することを示す。
参考スコア（独自算出の注目度）: 37.53072128034311
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reference-based diffusion models enable highly controllable image generation by leveraging elements from input images to guide prompt-driven synthesis. However, these models are computationally expensive in runtime, and their cost scales severely with the number of input references. While the efficiency of diffusion models has been extensively studied in the context of prompt-driven generation, it remains largely under-explored in the realm of reference-based models. This setting presents unique challenges not addressed by methods focusing solely on generation. In particular, the wasteful representation of references as dense token grids offers significant opportunities for improvement. In this work, we present Sparse Context, a method for constructing sparse reference representations by retaining only a reduced subset of reference tokens. We observe that even without modifying the model, dropping a significant portion of reference tokens at inference time largely preserves its generation capabilities. To fully realize this potential, we fine-tune the model with random token dropping at varying ratios, encouraging robustness to partial reference representations. Crucially, this training strategy decouples the model from any specific token selection rule, allowing flexible control at inference time. At inference time, instead of random dropping, we apply task-aware token selection strategies that prioritize the most informative regions of the reference images, adapting the token budget to the input and task requirements. Extensive experiments show our method achieves a 4x increase in inference speed for multi-reference generation and an 2x for single reference generation. Importantly, this efficiency is achieved without compromising visual quality across both spatially-aligned editing and subject-driven generation.
Abstract（参考訳）: 参照ベース拡散モデルは、入力画像からの要素を活用してプロンプト駆動合成を誘導することにより、高制御可能な画像生成を可能にする。しかし、これらのモデルは実行時に計算コストが高く、そのコストは入力参照数とともに著しくスケールする。拡散モデルの効率は、プロンプト駆動生成の文脈で広く研究されているが、参照ベースモデルの領域では、ほとんど探索されていない。この設定は、生成のみに焦点をあてるメソッドによって対処されないユニークな課題を提示します。特に、高密度なトークングリッドとしての参照の無駄な表現は、改善のための重要な機会を提供する。本研究では,参照トークンの少ないサブセットのみを保持することでスパース参照表現を構築する手法であるスパースコンテキストを提案する。モデルを変更することなく、推論時に参照トークンのかなりの部分を落としてしまうと、その生成能力は大きく保たれます。この可能性をフルに実現するために、ランダムトークンのドロップを様々な比率で微調整し、部分参照表現に対する堅牢性を奨励する。このトレーニング戦略は、特定のトークン選択ルールからモデルを分離し、推論時に柔軟な制御を可能にする。推論時には、ランダムドロップの代わりに、参照画像の最も情報性の高い領域を優先するタスク対応トークン選択戦略を適用し、トークン予算を入力およびタスク要求に適応させる。大規模な実験により,マルチ参照生成における推論速度は4倍に向上し,単一参照生成では2倍に向上した。重要なことに、この効率性は、空間的に整列した編集と主観的生成の両方で視覚的品質を損なうことなく達成される。

論文の概要: Keep The Essentials: Efficient Reference Conditioned Generation via Token Dropping

関連論文リスト