Fugu-MT 論文翻訳(概要): Positional Preservation Embedding for Multimodal Large Language Models

論文の概要: Positional Preservation Embedding for Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2510.22936v1
Date: Mon, 27 Oct 2025 02:40:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 15:28:15.422553
Title: Positional Preservation Embedding for Multimodal Large Language Models
Title（参考訳）: 多モーダル大言語モデルのための位置保存埋め込み
Authors: Mouxiao Huang, Borui Jiang, Dehua Zheng, Hailin Hu, Kai Han, Xinghao Chen,
Abstract要約: マルチモーダル言語モデル(LMLM)は視覚言語タスクにおいて高い性能を達成しているが、冗長な視覚トークンによってしばしば非効率に悩まされている。本研究では,トークン圧縮時の空間保存構造を新規に符号化する手法を提案する。 PPEは、プログレッシブトークン圧縮戦略であるクラスタリングを効果的にサポートし、パフォーマンスの維持を向上できることを示す。
参考スコア（独自算出の注目度）: 20.307929204794917
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens. Existing token merging methods reduce sequence length but frequently disrupt spatial layouts and temporal continuity by disregarding positional relationships. In this work, we propose a novel encoding operator dubbed as \textbf{P}ositional \textbf{P}reservation \textbf{E}mbedding (\textbf{PPE}), which has the main hallmark of preservation of spatiotemporal structure during visual token compression. PPE explicitly introduces the disentangled encoding of 3D positions in the token dimension, enabling each compressed token to encapsulate different positions from multiple original tokens. Furthermore, we show that PPE can effectively support cascade clustering -- a progressive token compression strategy that leads to better performance retention. PPE is a parameter-free and generic operator that can be seamlessly integrated into existing token merging methods without any adjustments. Applied to state-of-the-art token merging framework, PPE achieves consistent improvements of $2\%\sim5\%$ across multiple vision-language benchmarks, including MMBench (general vision understanding), TextVQA (layout understanding) and VideoMME (temporal understanding). These results demonstrate that preserving positional cues is critical for efficient and effective MLLM reasoning.
Abstract（参考訳）: マルチモーダル大規模言語モデル(MLLM)は視覚言語タスクにおいて高い性能を達成しているが、冗長な視覚トークンによってしばしば非効率に悩まされている。既存のトークンマージ手法はシーケンス長を減少させるが、位置関係を無視した空間配置や時間連続性を頻繁に破壊する。本稿では,視覚的トークン圧縮における時空間構造保存の主指標となる,新しい符号化演算子を,‘textbf{P}ositional \textbf{P}reservation \textbf{E}mbedding(\textbf{PPE})’と呼ぶ。 PPEは、トークン次元における3D位置の不整合符号化を明示的に導入し、圧縮されたトークンが複数の元のトークンから異なる位置をカプセル化できるようにする。さらに, PPEは, プログレッシブトークン圧縮戦略であるカスケードクラスタリングを効果的にサポートし, 性能の維持を図ることができることを示す。 PPEはパラメータフリーで汎用的な演算子で、調整なしで既存のトークンマージメソッドにシームレスに統合できる。最先端のトークンマージフレームワークに適用されるPPEは、MMBench(一般ビジョン理解)、TextVQA(レイアウト理解)、VideoMME(一時理解)など、複数のビジョンベンチマークで2.5%\sim5\%の一貫性のある改善を実現している。これらの結果から, 位置的手がかりの保存は, 効率的かつ効果的なMLLM推論に重要であることが示唆された。

論文の概要: Positional Preservation Embedding for Multimodal Large Language Models

関連論文リスト