Fugu-MT 論文翻訳(概要): How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

論文の概要: How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

arxiv url: http://arxiv.org/abs/2603.07540v1
Date: Sun, 08 Mar 2026 09:01:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:14.792064
Title: How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation
Title（参考訳）: 統合型マルチモーダルモデルによる画像生成の信頼性はどの程度か?-コンテキストキュレーションによる長距離インターリーブ画像生成を例に
Authors: Haoyu Chen, Qing Liu, Yuqian Zhou, He Zhang, Zhaowen Wang, Mengwei Ren, Jingjing Ren, Xiang Wang, Zhe Lin, Lei Zhu,
Abstract要約: 我々は、蓄積された視覚履歴が、生のトークン数ではなく画像イベントの数によって特に支配される、アクティブな汚染の源として機能すると主張している。完全リコールよりも安全な条件付けを優先するトレーニングフリー推論戦略であるUniLongGenを提案する。
参考スコア（独自算出の注目度）: 42.432491845154445
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified multimodal models hold the promise of generating extensive, interleaved narratives, weaving text and imagery into coherent long-form stories. However, current systems suffer from a critical reliability gap: as sequences grow, generation quality rapidly collapses. In this work, we investigate the mechanism behind this failure and argue that it is distinct from standard long-context challenges. We reveal that in generation, accumulated visual history acts as a source of active pollution, a decay governed specifically by the number of image events rather than raw token count. We identify a structural vulnerability where dense visual tokens overwhelm the attention mechanism, creating noise that distorts future synthesis. Guided by these mechanistic insights, we propose UniLongGen, a training-free inference strategy that prioritizes safe conditioning over total recall. Instead of retaining all history, UniLongGen dynamically curates the model's memory, identifying and discarding interfering visual signals based on the model's own internal relevance rankings. Extensive experiments demonstrate that this active forgetting approach is essential for stability: UniLongGen significantly outperforms baselines in long-horizon fidelity and consistency, while simultaneously reducing memory footprint and inference time.
Abstract（参考訳）: 統一されたマルチモーダルモデルは、広範かつインターリーブな物語を生成し、テキストとイメージを一貫性のあるロングフォームなストーリーに織り込むことを約束する。しかし、現在のシステムは、シーケンスが大きくなるにつれて、生成の品質が急速に低下する、重大な信頼性のギャップに悩まされている。本研究では,この障害の原因となるメカニズムを考察し,それが標準的な長期的課題とは異なっていることを論じる。生成過程において,蓄積された視覚履歴は,生のトークン数ではなく,画像イベントの数によって特に支配される,アクティブな汚染の源として機能することが明らかとなった。我々は、高密度な視覚トークンが注意機構を圧倒し、将来の合成を歪ませるノイズを生み出す構造的脆弱性を同定する。これらの力学的な知見に導かれ、我々は、完全リコールよりも安全な条件付けを優先するトレーニングフリー推論戦略であるUniLongGenを提案する。すべての履歴を保持する代わりに、UniLongGenはモデルのメモリを動的にキュレートし、モデルの内部関連ランキングに基づいて視覚信号の干渉を識別し破棄する。 UniLongGenは、メモリフットプリントと推論時間を同時に減らしながら、長い水平フィディリティと一貫性においてベースラインを大幅に上回る。

論文の概要: How Long Can Unified Multimodal Models Generate Images Reliably? Taming Long-Horizon Interleaved Image Generation via Context Curation

関連論文リスト