Fugu-MT 論文翻訳(概要): Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

論文の概要: Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

arxiv url: http://arxiv.org/abs/2603.10863v1
Date: Wed, 11 Mar 2026 15:15:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:33.019721
Title: Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding
Title（参考訳）: Beyond Sequential Distance: モード間距離不変位置符号化
Authors: Lin Chen, Bolin Ni, Qi Yang, Zili Wang, Kun Ding, Ying Wang, Houwen Peng, Shiming Xiang,
Abstract要約: MLLM(Multimodal Large Language Models)は、長期のコンテキストシナリオにおいて視覚的な色合いに悩まされる。モーダル距離位置変種(DIPE)を提案する。 DIPEは、モダリティ相互作用に基づく位置符号化をアンタングルする。
参考スコア（独自算出の注目度）: 37.24524628097006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)の際立った機能にもかかわらず、長いコンテキストのシナリオでは視覚的な色合いに悩まされている。具体的には、テキストシーケンスが長くなるにつれて、視覚的トークンへの注意が減少し、視覚的制約からテキスト生成が分離される。この劣化は、視覚とテキストのトークン間の距離が増加するにつれて、モーダル間注意を罰するマルチモーダル RoPE の固有の帰納バイアスに起因している。そこで本研究では,モーダル間距離不変位置符号化(DIPE)を提案する。 DIPEは、モーダル間相互作用の自然な相対位置を保ち、局所構造を保ちながら、モーダル間相互作用の知覚的近接を固定する。この戦略は、時間的距離に基づくペナルティを効果的に軽減し、視覚信号が文脈の長さに関係なく知覚的に一貫性を保つことを保証する。実験結果から,DIPEとMultimodal RoPEを統合することで,長期コンテキストシナリオにおける安定した視覚的グラウンド化を実現し,標準のショートコンテクストベンチマークの性能を保ちながら,視覚的フェージングを著しく軽減できることがわかった。コードはhttps://github.com/lchen1019/DIPEで入手できる。

論文の概要: Beyond Sequential Distance: Inter-Modal Distance Invariant Position Encoding

関連論文リスト