Fugu-MT 論文翻訳(概要): ParaFormer: Shallow Parallel Transformers with Progressive Approximation

論文の概要: ParaFormer: Shallow Parallel Transformers with Progressive Approximation

arxiv url: http://arxiv.org/abs/2510.15425v1
Date: Fri, 17 Oct 2025 08:28:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-20 20:17:34.536692
Title: ParaFormer: Shallow Parallel Transformers with Progressive Approximation
Title（参考訳）: Paraformer: プログレッシブ近似を用いた浅層並列変換器
Authors: Wei Wang, Xiao-Yong Wei, Qing Li,
Abstract要約: ParaFormerは、構造と計算の両方において真の並列性のために設計された浅層トランスフォーマーアーキテクチャである。理論的解析により,それらの性能は層間協調による漸進的近似に依存していることが示された。 ParaFormerは最大15.07倍のモデル圧縮をサポートし、適応型継続的学習のためのモデル拡張を容易にする。
参考スコア（独自算出の注目度）: 14.82319078008725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The widespread 'deeper is better' philosophy has driven the creation of architectures like ResNet and Transformer, which achieve high performance by stacking numerous layers. However, increasing model depth comes with challenges such as longer training times, higher inference latency, and impracticality on resource-constrained devices. To address these issues, we propose ParaFormer, a shallow Transformer architecture designed for true parallelism in both structure and computation. By formulating standard Transformers as function approximators in closed-form, our theoretical analysis shows that their performance relies on inter-layer collaboration for progressive approximation, rather than depth itself. While deep Transformers enforce this collaboration through sequential designs, we demonstrate that such collaboration is not inherently tied to sequential structures. ParaFormer removes the sequential constraint by organizing layers into parallel branches, enforcing inter-layer collaboration algorithmically. Specifically, we implement progressive approximation, ensuring that each new branch further reduces the loss from preceding branches, enabling faster convergence. Extensive experiments validate ParaFormer's effectiveness, outperforming standard Transformers like ViT. Moreover, ParaFormer supports up to 15.07x model compression and facilitates model expansion for adaptive continuous learning. Experimental results on multi-GPU deployment demonstrate that ParaFormer is 3.30x faster than widely used parallelism solutions such as FairScale. These advancements stem from our closed-form formulation of Transformers based on the Universal Approximation Theorem, which not only explains the ``depth belief'' but also opens new avenues for designing efficient Transformer architectures. Source code: https://(open-upon-acceptance)
Abstract（参考訳）: の哲学は、多数のレイヤを積み重ねることで高いパフォーマンスを達成するResNetやTransformerのようなアーキテクチャの作成を促した。しかし、モデル深度の増加には、長いトレーニング時間、推論レイテンシの向上、リソース制約のあるデバイスにおける非現実性といった課題が伴う。これらの問題に対処するために、構造と計算の両方において真の並列性のために設計された浅層トランスフォーマーアーキテクチャであるParaFormerを提案する。標準変換器を閉形式で関数近似器として定式化することにより、それらの性能は深度そのものではなく、階層間協調による進行近似に依存していることを示す。ディープトランスフォーマーはシーケンシャルな設計を通じてこのコラボレーションを強制するが、そのようなコラボレーションは本質的にシーケンシャルな構造に結びついていないことを実証する。 ParaFormerは、レイヤを並列ブランチに整理することで、シーケンシャルな制約を取り除き、レイヤ間のコラボレーションをアルゴリズム的に実施する。具体的には、進行近似を実装し、各新しいブランチが先行ブランチからの損失をさらに減らし、より高速な収束を可能にする。大規模な実験によりParaFormerの有効性が検証され、ViTのような標準トランスフォーマーよりも優れていた。さらに、ParaFormerは最大15.07倍のモデル圧縮をサポートし、適応型継続的学習のためのモデル拡張を容易にする。マルチGPUデプロイメントの実験結果によると、ParaFormerはFairScaleのような広く使われている並列処理ソリューションよりも3.30倍高速である。これらの進歩は、'depth belief'を説明できるだけでなく、効率的なトランスフォーマーアーキテクチャを設計するための新たな道を開くユニバーサル近似理論に基づくトランスフォーマーのクローズドフォームな定式化に起因している。ソースコード:https://(open-upon-acceptance)

論文の概要: ParaFormer: Shallow Parallel Transformers with Progressive Approximation

関連論文リスト