Fugu-MT 論文翻訳(概要): Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

論文の概要: Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

arxiv url: http://arxiv.org/abs/2605.14530v2
Date: Tue, 19 May 2026 00:58:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.353784
Title: Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models
Title（参考訳）: 大型拡散型ビジョンランゲージモデルにおける前方ドリフトと位置注意崩壊の緩和
Authors: Sujung Hong, Chanyong Yoon, Seong Jae Hwang,
Abstract要約: LDVLMは反復的な生成と劣化した視覚的接地に悩まされている。本研究では,Mask Prior Suppression と Monotonic RoPE Scaling を導入したトレーニングフリーアプローチを提案する。以上の結果から,これらの障害は軽量なプラグアンドプレイ戦略によって効果的に対処できることが示唆された。
参考スコア（独自算出の注目度）: 7.964052580720558
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large diffusion vision-language models (LDVLMs) have recently emerged as a promising alternative to autoregressive models, enabling parallel decoding for efficient inference and leveraging bidirectional attention for global context. Despite these advances, their behavior under long-form generation remains underexplored. In this work, we show that existing LDVLMs suffer from repetitive generation and degraded visual grounding, and identify two underlying causes. First, repetitive generation originates from a mask token prior: since generation tokens are initialized as mask tokens, their hidden representations progressively drift toward a shared prior direction over generation steps. Second, a fundamental misalignment between the positional attention bias and the iterative unmasking process suppresses attention toward informative visual tokens, degrading visual grounding. Based on these insights, we propose a training-free approach, introducing Mask Prior Suppression and Monotonic RoPE Scaling to mitigate mask prior drift and positional attention collapse during decoding. Experiments on general multimodal benchmarks and visual grounding tasks demonstrate improvements over baseline LDVLMs, with robust gains on long-form description benchmarks. Our results show that these failures can be effectively addressed with a lightweight, plug-and-play strategy that requires no additional training and generalizes across diverse LDVLM architectures.
Abstract（参考訳）: 大規模拡散視覚言語モデル(LDVLM)は近年,自己回帰モデルに代わる有望な代替手段として出現し,効率的な推論のための並列復号化と,グローバルな文脈における双方向の注意の活用を実現している。これらの進歩にもかかわらず、長文世代における彼らの行動は未解明のままである。本研究では,既存のLDVLMが繰り返し生成と劣化した視覚的グラウンドリングに悩まされていることを示し,その原因を2つ同定する。生成トークンはマスクトークンとして初期化されるため、隠れた表現は生成ステップよりも共有前の方向に向かって徐々に流れていく。第2に、位置注意バイアスと反復的アンマスキング過程の根本的な不一致は、視覚的接地を低下させ、情報的視覚トークンに対する注意を抑制する。これらの知見に基づいて,マスクのドリフトとデコード時の位置的注意崩壊を緩和するために,マスク先行抑制とモノトニックロPEスケーリングを導入したトレーニングフリーアプローチを提案する。一般的なマルチモーダルベンチマークと視覚的グラウンド化タスクの実験は、ベースラインLDVLMよりも改善され、ロングフォーム記述ベンチマークは頑健に向上した。以上の結果から, LDVLMアーキテクチャを多用する追加のトレーニングや一般化を必要としない軽量なプラグアンドプレイ戦略により, これらの障害を効果的に対処できることが示唆された。

論文の概要: Mitigating Mask Prior Drift and Positional Attention Collapse in Large Diffusion Vision-Language Models

関連論文リスト