Fugu-MT 論文翻訳(概要): AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors

論文の概要: AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors

arxiv url: http://arxiv.org/abs/2509.23109v1
Date: Sat, 27 Sep 2025 04:37:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.046893
Title: AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors
Title（参考訳）: AttAnchor: 注意アンカー付きVLMにおけるクロスモーダルトークンアライメントの誘導
Authors: Junyang Zhang, Tianyi Zhu, Thierry Tambe,
Abstract要約: 本研究では,意味論的に類似したトークンをモダリティ間で効率的にグループ化するパラメータフリーフレームワークであるAttention Anchorを提案する。関連する視覚的パッチの近くにテキストトークンを挿入することで、真のコンテンツベースのクロスモーダルアテンションスコアを示すセマンティックなサインポストを作成する。 AttAnchorは15のメトリクスとベンチマークのうち13の改善を実現している。
参考スコア（独自算出の注目度）: 3.9039205692819547
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A fundamental reason for the dominance of attention over RNNs and LSTMs in LLMs is its ability to capture long-range dependencies by modeling direct interactions between all tokens, overcoming the sequential limitations of recurrent architectures. Similarly, a key reason why today's vision language models (VLMs) hallucinate and underperform pure language models is that they rely on direct concatenation of image and text tokens with a modality-blinded positional encoding, which conveniently adopts the pretrained LLM backbone but forces unnecessary long-distance attention between semantically related tokens across modalities. This underscores the urgent need for mechanisms that efficiently enhance token locality and cross-modal alignment. In response, we propose Attention Anchor, a parameter-free framework that efficiently groups semantically similar tokens across modalities, improving cross-modal locality. By inserting text tokens near relevant visual patches, we create semantic signposts that reveal true content-based cross-modal attention scores, guiding the model to focus on the correct image regions for tasks such as VQA, MMBench and POPE. This improves answer accuracy and reduces hallucinations without disrupting the prompt's semantic flow. AttAnchor achieves improvements across 13 out of 15 different metrics and benchmarks, including up to 32% gains on reasoning tasks and up to 15% improvements on hallucination benchmarks. AttAnchor enables TinyLLaVA 1B to outperform much larger models like LLaVA 7B and QwenVL 3B on POPE with only 0.1% inference time overhead. To the best of our knowledge, this work is among the first to investigate mixed-modal token grouping, where text and image tokens are clustered jointly into shared groups rather than being grouped within a single modality or merely aligned post-hoc with additional alignment losses.
Abstract（参考訳）: LLM における RNN や LSTM に対する注目の優位性の基本的な理由は、すべてのトークン間の直接相互作用をモデル化し、繰り返しアーキテクチャの逐次的制限を克服することで、長距離依存を捉える能力である。同様に、今日の視覚言語モデル(VLM)が幻覚的かつ過小評価される主な理由は、画像とテキストトークンの直接結合とモダリティブロードされた位置符号化に依存しているためである。このことはトークンの局所性とクロスモーダルアライメントを効果的に強化するメカニズムの緊急の必要性を浮き彫りにする。そこで本研究では,モダリティ間で意味論的に類似したトークンを効率的にグループ化し,モダリティ間の局所性を向上するパラメータフリーフレームワークであるAttention Anchorを提案する。 VQA,MMBench,POPEなどのタスクに対して,テキストトークンを関連付けられた視覚的パッチの近くに挿入することにより,真のコンテンツベースのクロスモーダルアテンションスコアを明らかにするセマンティックなサインポストを作成する。これにより、応答精度が向上し、プロンプトのセマンティックフローを乱すことなく幻覚を低減する。 AttAnchorは15のメトリクスとベンチマークのうち13の改善を実現している。 AttAnchorにより、TinyLLaVA 1B は LLaVA 7B や QwenVL 3B といったより大きなモデルを POPE 上で0.1% の推論時間オーバーヘッドで上回ることができる。我々の知る限りでは、この研究は、テキストと画像のトークンが単一のモダリティ内でグループ化されるか、あるいは単にアライメントロスを伴うアライメント後にアライメントされるのではなく、共有グループに一緒にクラスタ化される、混合モーダルトークンのグルーピングを最初に調査するものである。

論文の概要: AttAnchor: Guiding Cross-Modal Token Alignment in VLMs with Attention Anchors

関連論文リスト