Fugu-MT 論文翻訳(概要): Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

論文の概要: Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

arxiv url: http://arxiv.org/abs/2603.23491v1
Date: Tue, 24 Mar 2026 17:57:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.629298
Title: Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation
Title（参考訳）: 浮動小数点拡散 - 空間適応型画像と映像生成の効率化-
Authors: Brian Chao, Lior Yariv, Howard Xiao, Gordon Wetzstein,
Abstract要約: 拡散とフローマッチングモデルにより、インタラクティブな画像やストリーミングビデオ生成といった、前例のない創造的コンテンツ生成能力が解放された。我々の研究は、ユーザの視線位置が分かっている設定や、例えば視線追跡を用いて推定できる設定において、生成プロセスの効率を最適化することを目指している。これらの設定では、偏心性に依存した人間の視力を活用し、ユーザは視線付近の小さな領域において、非常に高解像度な視覚情報を知覚する。我々のアプローチは、フェーベ分解能をモデル化して、トークンを一様ではなく一様に割り当て、より高いトークン密度をフォーベ領域以下に割り当てることから始まる。
参考スコア（独自算出の注目度）: 38.48147404244147
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion and flow matching models have unlocked unprecedented capabilities for creative content creation, such as interactive image and streaming video generation. The growing demand for higher resolutions, frame rates, and context lengths, however, makes efficient generation increasingly challenging, as computational complexity grows quadratically with the number of generated tokens. Our work seeks to optimize the efficiency of the generation process in settings where the user's gaze location is known or can be estimated, for example, by using eye tracking. In these settings, we leverage the eccentricity-dependent acuity of human vision: while a user perceives very high-resolution visual information in a small region around their gaze location (the foveal region), the ability to resolve detail quickly degrades in the periphery of the visual field. Our approach starts with a mask modeling the foveated resolution to allocate tokens non-uniformly, assigning higher token density to foveal regions and lower density to peripheral regions. An image or video is generated in a mixed-resolution token setting, yielding results perceptually indistinguishable from full-resolution generation, while drastically reducing the token count and generation time. To this end, we develop a principled mechanism for constructing mixed-resolution tokens directly from high-resolution data, allowing a foveated diffusion model to be post-trained from an existing base model while maintaining content consistency across resolutions. We validate our approach through extensive analysis and a carefully designed user study, demonstrating the efficacy of foveation as a practical and scalable axis for efficient generation.
Abstract（参考訳）: 拡散とフローマッチングモデルにより、インタラクティブな画像やストリーミングビデオ生成といった、前例のない創造的コンテンツ生成能力が解放された。しかし、高解像度、フレームレート、コンテキスト長の需要の増加により、計算複雑性が生成トークンの数で2次的に増加するにつれて、効率的な生成がますます困難になる。我々の研究は、ユーザの視線位置が分かっている設定や、例えば視線追跡を用いて推定できる設定において、生成プロセスの効率を最適化することを目指している。これらの設定では、偏心性に依存した人間の視力を利用する:ユーザーは視線周辺の小さな領域(前頭葉領域)で非常に高解像度の視覚情報を知覚するが、視野の周囲で細部を素早く解ける能力は急速に低下する。提案手法は, フェーベ分解能をマスクモデルでモデル化し, 非一様にトークンを割り当て, より高いトークン密度をフェーベ領域に割り当て, 周辺領域に低密度を割り当てることから始まる。画像又は映像は混合解像度のトークン設定で生成され、トークン数と生成時間を劇的に短縮しつつ、フル解像度の生成と知覚的に区別できない結果が得られる。そこで本研究では,高分解能データから直接混合分解能トークンを構築するための原理的メカニズムを開発し,既存のベースモデルからフェーブ拡散モデルをポストトレーニングし,コンテントの整合性を維持しつつ,高分解能データから直接混合分解能トークンを構築する。本研究では,大規模な分析と慎重に設計したユーザスタディを通じて,効率的な生成のための実用的かつスケーラブルな軸としてフェーベーションの有効性を実証する。

論文の概要: Foveated Diffusion: Efficient Spatially Adaptive Image and Video Generation

関連論文リスト