Fugu-MT 論文翻訳(概要): FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

論文の概要: FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores

arxiv url: http://arxiv.org/abs/2311.05908v1
Date: Fri, 10 Nov 2023 07:33:35 GMT
ステータス: 翻訳完了
システム内更新日: 2023-11-13 15:44:16.539791
Title: FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
Title（参考訳）: FlashFFTConv: テンソルコアによる長いシーケンスの効率的な畳み込み
Authors: Daniel Y. Fu, Hermann Kumbong, Eric Nguyen, Christopher R\'e
Abstract要約: 長いフィルタを持つ畳み込みモデルは、多くの長いシーケンスタスクにおいて最先端の推論能力を示している。 Fast Fourier Transform (FFT) は、長い畳み込みを$O(N logN)$ time in sequence length $N$で実行可能にするが、ハードウェア利用は乏しい。本稿では,FFT畳み込みの最適化方法について検討する。
参考スコア（独自算出の注目度）: 18.016204763652553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Convolution models with long filters have demonstrated state-of-the-art reasoning abilities in many long-sequence tasks but lag behind the most optimized Transformers in wall-clock time. A major bottleneck is the Fast Fourier Transform (FFT)--which allows long convolutions to run in $O(N logN)$ time in sequence length $N$ but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution. We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy. In response, we propose FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O. We also present two sparse convolution algorithms--1) partial convolutions and 2) frequency-sparse convolutions--which can be implemented simply by skipping blocks in the matrix decomposition, enabling further opportunities for memory and compute savings. FlashFFTConv speeds up exact FFT convolutions by up to 7.93$\times$ over PyTorch and achieves up to 4.4$\times$ speedup end-to-end. Given the same compute budget, FlashFFTConv allows Hyena-GPT-s to achieve 2.3 points better perplexity on the PILE and M2-BERT-base to achieve 3.3 points higher GLUE score--matching models with twice the parameter count. FlashFFTConv also achieves 96.1% accuracy on Path-512, a high-resolution vision task where no model had previously achieved better than 50%. Furthermore, partial convolutions enable longer-sequence models--yielding the first DNA model that can process the longest human genes (2.3M base pairs)--and frequency-sparse convolutions speed up pretrained models while maintaining or improving model quality.
Abstract（参考訳）: 長いフィルタを持つ畳み込みモデルは、多くの長いシーケンスタスクにおいて最先端の推論能力を示しているが、ウォールクロック時間において最も最適化されたトランスフォーマーよりも遅れている。 A major bottleneck is the Fast Fourier Transform (FFT)--which allows long convolutions to run in $O(N logN)$ time in sequence length $N$ but has poor hardware utilization. In this paper, we study how to optimize the FFT convolution. We find two key bottlenecks: the FFT does not effectively use specialized matrix multiply units, and it incurs expensive I/O between layers of the memory hierarchy. In response, we propose FlashFFTConv. FlashFFTConv uses a matrix decomposition that computes the FFT using matrix multiply units and enables kernel fusion for long sequences, reducing I/O. We also present two sparse convolution algorithms--1) partial convolutions and 2) 周波数スパース畳み込みは,行列分解のブロックをスキップするだけで実現でき,メモリと計算の節約が可能となる。 FlashFFTConvは、PyTorch上でFFTの正確な畳み込みを最大7.93$\times$でスピードアップし、最大4.4$\times$エンドツーエンドをスピードアップする。同じ計算予算で、FlashFFTConvはHyena-GPT-sがPILEとM2-BERTベースで2.3ポイント、GLUEスコアマッチングモデルが3.3ポイント向上し、パラメータ数が2倍になった。また、FlashFFTConvはPth-512で96.1%の精度を達成している。さらに、部分的な畳み込みにより、より長いシーケンスモデル、すなわち、最も長いヒト遺伝子(2.3M塩基対)を処理できる最初のDNAモデルが得られる。

関連論文リスト

RecConv: Efficient Recursive Convolutions for Multi-Frequency Representations [8.346566205092433]
RecConvは、小さなカーネル畳み込みを用いた多周波表現を効率的に構築する分解戦略である。 RecNeXt-M3 は RepViT-M1.1 を COCO 上で 1.9$APbox$ で上回っている。
論文参考訳（メタデータ） (2024-12-27T13:13:52Z)
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning [11.508362885430133]
非対称なGPUメモリ階層を利用して、メモリの大幅な節約と実行時の高速化を実現しています。 FlashAttentionはまだGEMM(Optimized matrix-multiply)操作ほど高速ではなく、理論上の最大FLOP/sの25-40%にしか達していない。これらの問題に対処するために、より優れた作業パーティショニングを備えたFlashAttention-2を提案する。
論文参考訳（メタデータ） (2023-07-17T17:50:36Z)
Im2win: An Efficient Convolution Paradigm on GPU [1.9162301033784574]
本稿では、メモリフットプリントの削減だけでなく、連続的なメモリアクセスを提供するim2winと呼ばれる畳み込みベースの畳み込みに関するパラダイムを提案する。直接畳み込みと、PyTorchのGEMMベースの畳み込みと、DNNベースの畳み込み実装の6ドルを、12の最先端ベンチマークで比較する。
論文参考訳（メタデータ） (2023-06-25T19:09:56Z)
Simple Hardware-Efficient Long Convolutions for Sequence Modeling [18.3719016967593]
状態空間モデル(SSM)は、長いシーケンスモデリングにおいて高い性能を持つ。単純な代替手段が性能と効率においてSSMと一致するかどうかを考察する。我々は、長い畳み込みのランタイム性能を改善するためのIO対応アルゴリズムであるFlashButterflyを開発した。
論文参考訳（メタデータ） (2023-02-13T19:19:23Z)
FInC Flow: Fast and Invertible $k \times k$ Convolutions for Normalizing Flows [2.156373334386171]
可逆畳み込みは、表現的正規化フローベース生成モデルを構築する上で不可欠な要素である。我々は、$k倍の畳み込み層とDeep Normalizing Flowアーキテクチャを提案する。
論文参考訳（メタデータ） (2023-01-23T04:31:03Z)
Softmax-free Linear Transformers [90.83157268265654]
視覚変換器(ViT)は、視覚知覚タスクの最先端を推し進めている。既存の手法は理論的に欠陥があるか、視覚認識に経験的に効果がないかのいずれかである。我々はSoftmax-Free Transformers (SOFT) のファミリーを提案する。
論文参考訳（メタデータ） (2022-07-05T03:08:27Z)
Early Convolutions Help Transformers See Better [63.21712652156238]
視覚変換器(ViT)モデルは準標準最適化性を示す。現代の畳み込みニューラルネットワークは、最適化がはるかに容易である。畳み込みステムをViTで使用すると、最適化の安定性が劇的に向上し、ピーク性能も向上する。
論文参考訳（メタデータ） (2021-06-28T17:59:33Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
相対的位置符号化(RPE)を用いた変換器の注意計算を高速化する新しい手法を提案する。相対的な位置符号化がToeplitz行列を形成するという観測に基づいて、Fast Fourier Transform (FFT) を用いて、RPEによるカーネル化された注意を効率的に計算できることを数学的に示す。
論文参考訳（メタデータ） (2021-06-23T17:51:26Z)
Decoupled Dynamic Filter Networks [85.38058820176047]
これらの欠点を同時に解決できるDDF(Decoupled Dynamic Filter)を提案します。最近の注目の高まりに触発されて、DDFは深度ワイドなダイナミックフィルタを空間的およびチャネル的ダイナミックフィルタに分離する。分類ネットワークにおける標準畳み込みをDFFに置き換える際の性能向上を観察する。
論文参考訳（メタデータ） (2021-04-29T04:55:33Z)
XSepConv: Extremely Separated Convolution [60.90871656244126]
極めて分離された畳み込みブロック(XSepConv)を提案する。空間的に分離可能な畳み込みを奥行きの畳み込みに融合させ、大きなカーネルの計算コストとパラメータサイズの両方を削減する。 XSepConvは、大規模なカーネルサイズを持つバニラ奥行きの畳み込みの効率的な代替として設計されている。
論文参考訳（メタデータ） (2020-02-27T11:46:17Z)
DFTpy: An efficient and object-oriented platform for orbital-free DFT simulations [55.41644538483948]
本稿では、Python 3で完全に書かれたOFDFTを実装したオープンソースソフトウェアであるDFTpyを紹介する。本稿では,1CPUで計算したアルミニウムの100万原子系の電子構造について紹介する。 DFTpyはMITライセンスでリリースされている。
論文参考訳（メタデータ） (2020-02-07T19:07:41Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。