Fugu-MT 論文翻訳(概要): GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention

論文の概要: GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention

arxiv url: http://arxiv.org/abs/2606.06249v1
Date: Thu, 04 Jun 2026 14:52:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.874251
Title: GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention
Title（参考訳）: GRAMformer: ボリュームマルチモーダル・クロスアテンションによる任意の順序モード相互作用
Authors: Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello,
Abstract要約: Volume Multimodal Cross-Attention (VMA) VMAは、クエリのジョイントジオメトリと複数のモダリティ固有のキーの関数としてアテンションスコアが定義される、新しいクロスアテンションメカニズムである。 VMAは、クエリとキーベクタによって複数のモードにまたがるボリュームを計算し、ペアの類似性を超えた共同マルチモーダル依存関係をキャプチャする。
参考スコア（独自算出の注目度）: 15.387737375519286
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer-based multimodal models rely on attention mechanisms to integrate information across heterogeneous modalities. Despite their success, existing multimodal attention formulations compute their scores through collections of pairwise dot-product interactions or by concatenating all the modalities into the keys, even when multiple modalities should be jointly involved. As a consequence, current approaches either incur quadratic complexity in the number of modalities or fail to explicitly model interactions that depend on the joint configuration of multiple representations. In this work, we introduce the Volumetric Multimodal cross-Attention (VMA), a novel cross-attention mechanism in which attention scores are defined as a function of the joint geometry of a query and multiple modality-specific keys. VMA computes the volume spanned by query and key vectors across multiple modalities, capturing joint multimodal dependencies beyond pairwise similarity, enabling native modeling of any-order modality interactions. We integrate VMA into our novel multimodal transformer architecture, named GRAMformer, explicitly designed to integrate any number of modalities. We evaluate the proposed model on multimodal learning tasks, demonstrating improved effectiveness and efficiency.
Abstract（参考訳）: トランスフォーマーベースのマルチモーダルモデルは、異質なモーダルをまたいだ情報を統合するための注意機構に依存している。その成功にもかかわらず、既存のマルチモーダルアテンションの定式化は、ペアのドット積相互作用の集まりや、複数のモーダルが連関されるべきである場合でも、すべてのモダリティをキーにまとめることによってスコアを計算する。結果として、現在のアプローチは、モダリティの数において二次的な複雑さを発生させるか、あるいは複数の表現の結合構成に依存する相互作用を明示的にモデル化することができないかのいずれかである。本稿では,クエリのジョイントジオメトリとマルチモーダル特化キーの関数として注目スコアを定義可能な,新しいクロスアテンション機構であるボリュームマルチモーダル・クロスアテンション(VMA)を紹介する。 VMAはクエリとキーベクタによって複数のモーダルにまたがるボリュームを計算し、ペアの類似性を超えたジョイントマルチモーダル依存関係をキャプチャし、任意の順序のモーダル相互作用のネイティブなモデリングを可能にする。我々はVMAをGRAMformerという名前の新しいマルチモーダルトランスフォーマーアーキテクチャに統合し、多くのモダリティを統合するように設計されています。提案したマルチモーダル学習課題モデルの評価を行い,改善された効率と効率性を実証した。

論文の概要: GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention

関連論文リスト