Fugu-MT 論文翻訳(概要): Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

論文の概要: Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

arxiv url: http://arxiv.org/abs/2604.06014v2
Date: Fri, 10 Apr 2026 15:47:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 13:51:27.655615
Title: Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating
Title（参考訳）: Gated-SwinRMT:入力依存ゲーティングによるマンハッタンの抑止でスウィンウィンドウの注意を統一する
Authors: Dipan Maity, Suman Mondal, Arindam Roy,
Abstract要約: Gated-SwinRMTは、Swin Transformerのシフトウインドウの注意とRetentive Networks (RMT)のマンハッタン距離空間減衰を組み合わせたハイブリッド・ビジョン・トランスフォーマーのファミリーである。 Gated-SwinRMT-SWATは、ソフトマックスにシグモイド活性化を代用し、乗算後空間崩壊を伴うバランスの取れたALiBi斜面を実装し、SwiGLUを介して値投影をゲートする。 Gated-SwinRMT-Retentionは、追加の対数空間崩壊バイアスでソフトマックス正規化保持を維持し、明示的なG1シグモイドゲートを組み込む
参考スコア（独自算出の注目度）: 0.6945765172815976
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce Gated-SwinRMT, a family of hybrid vision transformers that combine the shifted-window attention of the Swin Transformer with the Manhattan-distance spatial decay of Retentive Networks (RMT), augmented by input-dependent gating. Self-attention is decomposed into consecutive width-wise and height-wise retention passes within each shifted window, where per-head exponential decay masks provide a two-dimensional locality prior without learned positional biases. Two variants are proposed.Gated-SwinRMT-SWAT substitutes softmax with sigmoid activation, implements balanced ALiBi slopes with multiplicative post-activation spatial decay, and gates the value projection via SwiGLU; the Normalized output implicitly suppresses uninformative attention scores. \textbf{Gated-SwinRMT-Retention} retains softmax-normalized retention with an additive log-space decay bias and incorporates an explicit G1 sigmoid gate -- projected from the block input and applied after local context enhancement (LCE) but prior to the output projection~$W_O$ -- to alleviate the low-rank $W_V \!\cdot\! W_O$ bottleneck and enable input-dependent suppression of attended outputs. We assess both variants on Mini-ImageNet ($224{\times}224$, 100 classes) and CIFAR-10 ($32{\times}32$, 10 classes) under identical training protocols, utilizing a single GPU due to resource limitations. At ${\approx}77$--$79$\,M parameters, Gated-SwinRMT-SWAT achieves $80.22\%$ and Gated-SwinRMT-Retention $78.20\%$ top-1 test accuracy on Mini-ImageNet, compared with $73.74\%$ for the RMT baseline. On CIFAR-10 -- where small feature maps cause the adaptive windowing mechanism to collapse attention to global scope -- the accuracy advantage compresses from $+6.48$\,pp to $+0.56$\,pp.
Abstract（参考訳）: 本稿では,Swin Transformerの風向とRMTのマンハッタン距離空間減衰を組み合わせたハイブリッド・ビジョン・トランスフォーマーのファミリーであるGated-SwinRMTを紹介する。自己アテンションは、各シフトウインドウ内で連続した幅方向と高さ方向の保持パスに分解され、この際、頭当たりの指数減衰マスクは、学習された位置バイアスなしで2次元の局所性を提供する。 Gated-SwinRMT-SWATはソフトマックスにシグミド活性化を代用し、乗算後空間崩壊を伴うバランスの取れたALiBi斜面を実装し、SwiGLUを介して値投影をゲートし、正規化出力は非形式的注意スコアを暗黙的に抑制する。 \textbf{Gated-SwinRMT-Retention} は、ソフトマックス正規化保持を付加的な対数空間の崩壊バイアスで保持し、ブロック入力から射出され、ローカルコンテキスト拡張(LCE)後に適用される明示的なG1シグモノイドゲートを組み込んで、出力プロジェクション~$W_O$ -- の前に、低ランクの$W_V \!を緩和する。デーモン! W_O$ボトルネックを発生させ、入出力の入力依存的な抑制を可能にする。 Mini-ImageNet (224{\times} 224$, 100 class) と CIFAR-10 (32{\times} 32$, 10 class) の2つの変種を同一のトレーニングプロトコルで評価し、リソース制限のため単一のGPUを利用する。 ${\approx}77$-79$\,M パラメータで Gated-SwinRMT-SWAT は 80.22\%$ と Gated-SwinRMT-Retention 7,8.20\%$ top-1 test accuracy を Mini-ImageNet で達成した。 CIFAR-10 -- 小さな特徴マップが適応ウィンドウ機構のグローバルスコープへの注意を崩壊させる -- では、精度の優位性は$+6.48$\,ppから$+0.56$\,ppに圧縮される。

論文の概要: Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating

関連論文リスト