Fugu-MT 論文翻訳(概要): Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers

論文の概要: Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers

arxiv url: http://arxiv.org/abs/2606.14757v1
Date: Mon, 08 Jun 2026 09:24:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:32.047054
Title: Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers
Title（参考訳）: 小型・限られたデータビジョン変換器のための空間充填曲線による空間優先
Authors: Leyla Naz Candogan, Arshia Afzal, Pol Puigdemont, Volkan Cevher,
Abstract要約: VIOLINは、空間充填曲線 (Space Filling Curves, SFC) を介して、注意の中の空間構造を符号化するアテンション機構である。幅広い評価において、一貫してパフォーマンスを改善します。パラメータ効率のよいLoRAのような微調整手法と組み合わせることで、パフォーマンスをさらに向上させることができる。
参考スコア（独自算出の注目度）: 43.297561003640176
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Though Vision Transformers (ViTs) have become the dominant backbone in many computer vision tasks, due to permutation equivariance, their attention mechanism lacks explicit spatial inductive biases. This become particularly important in two settings: when model capacity is small or training data is limited. Inspired by the attention masking strategies in Linear Transformers and the scanning patterns of Vision SSMs, we introduce VIOLIN, a lightweight masked attention mechanism that encodes spatial structure within attention via Space Filling Curves (SFCs) with less than 0.0015% extra parameters and negligible computational overhead. VIOLIN scans the image using multiple SFCs to construct curve-specific decay masks, which are then combined and multiplied with the attention matrix. Across a wide range of evaluations, VIOLIN consistently improves performance. In limited data regimes such as fine-tuning on VTAB-1K, it boosts accuracy across all task groups and by up to 8.7% on the tasks where spatial information is essential. It can be combined with parameter-efficient fine-tuning methods such as LoRA to further increase the performance. Beyond fine-tuning, VIOLIN improves various small scale ViT architectures (e.g., DeiT, DINO) during pretraining on ImageNet-1K. Additionally, on pixel-level CIFAR-100 training, a task that is highly dependent on location information, VIOLIN increases accuracy by up to 7.2%. Overall, VIOLIN provides a computationally efficient yet effective way to inject spatial inductive bias into ViTs, especially benefiting small models and limited data settings.
Abstract（参考訳）: 視覚変換器(ViT)は多くのコンピュータビジョンタスクにおいて支配的なバックボーンとなっているが、置換同値のため、その注意機構は明示的な空間帰納バイアスを欠いている。これは2つの設定で特に重要になる。モデルキャパシティが小さい場合や、トレーニングデータに制限がある場合だ。線形変換器のアテンションマスキング戦略とビジョンSSMの走査パターンから着想を得たVIOLINは、スペースフィリング曲線(Space Filling Curves, SFC)を介して、0.0015%の余剰パラメータと無視可能な計算オーバーヘッドで、注意の中の空間構造を符号化する軽量なマスキングアテンション機構である。 VIOLINは複数のSFCを用いて画像をスキャンし、曲線固有の減衰マスクを構築し、アテンションマトリックスと組み合わせて乗算する。幅広い評価において、VIOLINは一貫してパフォーマンスを改善している。 VTAB-1Kの微調整のような制限されたデータ構造では、全タスク群における精度を最大8.7%向上させ、空間情報が不可欠であるタスクに対して最大8.7%向上させる。パラメータ効率のよいLoRAのような微調整手法と組み合わせることで、パフォーマンスをさらに向上させることができる。微調整以外にも、VIOLINはImageNet-1Kでの事前トレーニング中に、様々な小さなViTアーキテクチャ(例:DeiT、DINO)を改善している。さらに、位置情報に依存するタスクであるピクセルレベルのCIFAR-100トレーニングでは、VIOLINは精度を最大7.2%向上させる。全体として、VIOLINは空間誘導バイアスをViTに注入する計算的に効率的で効果的な方法を提供する。

論文の概要: Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers

関連論文リスト