Fugu-MT 論文翻訳(概要): Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

論文の概要: Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

arxiv url: http://arxiv.org/abs/2509.15448v1
Date: Thu, 18 Sep 2025 21:44:07 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-22 18:18:10.914734
Title: Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems
Title（参考訳）: 階層的自己注意:マルチスケール問題へのニューラルアテンション力学の一般化
Authors: Saeed Amizadeh, Sara Abdali, Yinheng Li, Kazuhito Koishida,
Abstract要約: まず,マルチモーダル・マルチスケールデータを表す数学的構成法を提案する。次に,エントロピー最小化の第一原理から,提案した構造に対する神経的注意機構を数学的に導出する。導出した定式化は、標準ソフトマックスの注意に最も近いという意味で最適であることを示す。
参考スコア（独自算出の注目度）: 14.98480544580102
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers and their attention mechanism have been revolutionary in the field of Machine Learning. While originally proposed for the language data, they quickly found their way to the image, video, graph, etc. data modalities with various signal geometries. Despite this versatility, generalizing the attention mechanism to scenarios where data is presented at different scales from potentially different modalities is not straightforward. The attempts to incorporate hierarchy and multi-modality within transformers are largely based on ad hoc heuristics, which are not seamlessly generalizable to similar problems with potentially different structures. To address this problem, in this paper, we take a fundamentally different approach: we first propose a mathematical construct to represent multi-modal, multi-scale data. We then mathematically derive the neural attention mechanics for the proposed construct from the first principle of entropy minimization. We show that the derived formulation is optimal in the sense of being the closest to the standard Softmax attention while incorporating the inductive biases originating from the hierarchical/geometric information of the problem. We further propose an efficient algorithm based on dynamic programming to compute our derived attention mechanism. By incorporating it within transformers, we show that the proposed hierarchical attention mechanism not only can be employed to train transformer models in hierarchical/multi-modal settings from scratch, but it can also be used to inject hierarchical information into classical, pre-trained transformer models post training, resulting in more efficient models in zero-shot manner.
Abstract（参考訳）: トランスフォーマーとその注意機構は、機械学習の分野で革命的だ。元々は言語データのために提案されていたが、画像、ビデオ、グラフなどのデータモダリティへの道が、様々な信号ジオメトリですぐに分かった。この汎用性にもかかわらず、潜在的に異なるモダリティから異なるスケールでデータが提示されるシナリオへの注意機構の一般化は簡単ではない。変換器に階層構造と多重モダリティを組み込む試みは、主にアドホックなヒューリスティックに基づいており、潜在的に異なる構造を持つ同様の問題に対してシームレスに一般化できない。この問題に対処するため,本稿では,まずマルチモーダル・マルチスケールデータを表す数学的構成法を提案する。次に,エントロピー最小化の第一原理から,提案した構造に対する神経的注意機構を数学的に導出する。得られた定式化は,問題の階層的/幾何学的情報から生じる帰納的バイアスを取り入れつつ,標準ソフトマックスの注意に最も近い意味で最適であることを示す。さらに,動的プログラミングに基づく効率的なアルゴリズムを提案する。変換器に組み込むことにより、提案した階層的アテンション機構をスクラッチから階層的/マルチモーダルな設定でトレーニングできるだけでなく、古典的、事前訓練された変換器モデルに階層的情報を注入することも可能であり、結果としてゼロショット方式でより効率的なモデルが得られることを示す。

関連論文リスト

Interpreting Transformer Architectures as Implicit Multinomial Regression [3.2371089062298317]
固定された多項回帰設定では、潜在特徴よりも最適化することで、注意ブロックによって引き起こされる力学と整合する最適解が得られることを示す。言い換えれば、変換器による表現の進化は、分類に最適な特徴を回復する軌跡として解釈できる。
論文参考訳（メタデータ） (2025-09-04T20:40:37Z)
Dynamical Mean-Field Theory of Self-Attention Neural Networks [0.0]
トランスフォーマーベースのモデルは、様々な領域で例外的な性能を示している。動作方法や期待されるダイナミクスについてはほとんど分かっていない。非平衡状態における非対称ホップフィールドネットワークの研究に手法を用いる。
論文参考訳（メタデータ） (2024-06-11T13:29:34Z)
Multi-Hierarchical Surrogate Learning for Structural Dynamical Crash Simulations Using Graph Convolutional Neural Networks [5.582881461692378]
カルトフレームの一連のサロゲートモデルを構造的に生成する多階層フレームワークを提案する。マルチスケール現象では、粗いサロゲート上でマクロスケールの特徴が捉えられ、ミクロスケール効果はより微細なサロゲートによって解決される。我々は、粗い表現上でパラメータ依存の低次元潜在力学を学習するグラフ畳み込みニューラルネットワークに基づくサロゲートを訓練する。
論文参考訳（メタデータ） (2024-02-14T15:22:59Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
エンコーダのみの浅部変圧器のグローバル収束理論を現実的な条件下で構築する。我々の結果は、現代のトランスフォーマー、特にトレーニング力学の理解を深める道を開くことができる。
論文参考訳（メタデータ） (2023-11-02T20:03:05Z)
From system models to class models: An in-context learning paradigm [0.0]
本稿では,1段階の予測と複数段階のシミュレーションという2つの主要な課題に対処する,システム識別のための新しいパラダイムを提案する。動的システムのクラスを表すメタモデルを学ぶ。一段階の予測では、GPTのようなデコーダのみのアーキテクチャを使用し、シミュレーション問題ではエンコーダ-デコーダ構造を用いる。
論文参考訳（メタデータ） (2023-08-25T13:50:17Z)
Mega: Moving Average Equipped Gated Attention [150.3124713793503]
メガ (Mega) は、(予備的な)移動平均を備えた単純で理論上は接地された単頭誘導式アテンション機構である。我々はMegaがトランスフォーマーの変種や最近の状態空間モデルを含む他のシーケンスモデルよりも大幅に改善されていることを示す。
論文参考訳（メタデータ） (2022-09-21T20:52:17Z)
Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation [0.0]
MultiformerはTransformerベースのモデルであり、各ヘッドに異なるアテンションメカニズムを使用することができる。これを行うことで、モデルはより多様なトークン相互作用の抽出に自己注意を偏らせることができる。その結果、異なる頭部と層に沿った注意パターンの混合は、我々の基準線を最大0.7BLEUで上回ることがわかった。
論文参考訳（メタデータ） (2022-05-14T17:37:47Z)
Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning [114.36124979578896]
オフライン強化学習アルゴリズムを用いて動的メカニズムを設計する。我々のアルゴリズムは悲観主義の原理に基づいており、オフラインデータセットのカバレッジについて軽度な仮定しか必要としない。
論文参考訳（メタデータ） (2022-05-05T05:44:26Z)
Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
隠れた表現とパラメータを複数のメカニズムに分割し、注意を通して情報を交換する新しいトランスフォーマー層を提案する。 TIM を大規模 BERT モデル、画像変換器、および音声強調について研究し、意味的に意味のある専門化とパフォーマンスの向上の証拠を見つけます。
論文参考訳（メタデータ） (2021-02-27T21:48:46Z)
Attention that does not Explain Away [54.42960937271612]
Transformerアーキテクチャに基づくモデルは、大規模なタスクに対して競合するアーキテクチャに基づくモデルよりも精度が高い。 Transformerのユニークな特徴は、任意の距離で自由な情報の流れを可能にする自己認識機構の普遍的な応用である。本稿では,実装が簡単で,"説明的回避"効果を避けるための理論的保証を提供する,二重正規化アテンション方式を提案する。
論文参考訳（メタデータ） (2020-09-29T21:05:39Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
我々は,各エンコーダ層の注意頭数のみを,単純な固定型(非学習型)の注意パターンに置き換えることを提案する。異なるデータサイズと複数の言語ペアを用いた実験により、トレーニング時にトランスフォーマーのエンコーダ側でアテンションヘッドを固定することは翻訳品質に影響を与えないことが示された。
論文参考訳（メタデータ） (2020-02-24T13:53:06Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。