Fugu-MT 論文翻訳(概要): A Tensor Compiler for Processing-In-Memory Architectures

論文の概要: A Tensor Compiler for Processing-In-Memory Architectures

arxiv url: http://arxiv.org/abs/2511.15503v1
Date: Wed, 19 Nov 2025 14:58:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-20 15:51:28.856587
Title: A Tensor Compiler for Processing-In-Memory Architectures
Title（参考訳）: メモリ内処理のためのテンソルコンパイラ
Authors: Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko, Christina Giannoula,
Abstract要約: Processing-In-Memory(PIM)デバイスは、Large Language Models(LLM)を含む機械学習(ML)モデルにおいて、メモリ集約カーネルを加速することができる。現在のコンパイルアプローチでは、複数のPIMバックエンドにまたがる多様なMLカーネルの体系的な最適化が欠如している。我々は、データ再構成と計算コード最適化を共同で最適化するPIMシステムのための、最初のデータ中心のMLコンパイラDCCを設計する。
参考スコア（独自算出の注目度）: 8.353569627672622
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Processing-In-Memory (PIM) devices integrated with high-performance Host processors (e.g., GPUs) can accelerate memory-intensive kernels in Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging high memory bandwidth at PIM cores. However, Host processors and PIM cores require different data layouts: Hosts need consecutive elements distributed across DRAM banks, while PIM cores need them within local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM backends. Current compilation approaches lack systematic optimization for diverse ML kernels across multiple PIM backends and may largely ignore data rearrangements during compute code optimization. We demonstrate that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. To address this, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction that enables various data distribution and processing strategies on different PIM backends. DCC enables effective co-optimization by mapping data partitioning strategies to compute loop partitions, applying PIM-specific code optimizations and leveraging a fast and accurate performance prediction model to select optimal configurations. Our evaluations in various individual ML kernels demonstrate that DCC achieves up to 7.68x speedup (2.7x average) on HBM-PIM and up to 13.17x speedup (5.75x average) on AttAcc PIM backend over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by up to 7.71x (4.88x average) over GPU.
Abstract（参考訳）: 高性能ホストプロセッサ(GPUなど)と統合されたPIM(Processing-In-Memory)デバイスは、PIMコアでの高メモリ帯域幅を活用することで、Large Language Models(LLM)を含む機械学習(ML)モデルにおいて、メモリ集約的なカーネルを加速することができる。しかし、ホストプロセッサとPIMコアは異なるデータレイアウトを必要とする: ホストはDRAMバンクにまたがる連続的な要素を必要とし、PIMコアはローカルバンク内でそれらを必要とする。これにより、MLカーネルの実行において、さまざまなPIMバックエンドをサポートする必要性によってさらに悪化する、大幅なパフォーマンスとプログラマビリティの課題を引き起こすデータアレンジメントが必要になる。現在のコンパイルアプローチでは、複数のPIMバックエンドにまたがる多様なMLカーネルの体系的な最適化が欠如しており、計算コード最適化時のデータ再構成をほとんど無視する可能性がある。我々は、データアレンジメントと計算コードの最適化が相互依存していることを示し、チューニングプロセス中に共同で最適化する必要がある。そこで我々は,PIMシステム用のデータ中心型MLコンパイラDCCを設計し,データ再構成と計算コードを統一的なチューニングプロセスで共同で最適化する。 DCCは多層PIM抽象化を統合し、異なるPIMバックエンド上で様々なデータ分散と処理戦略を可能にする。 DCCは、データパーティショニング戦略をループ分割の計算にマッピングし、PIM固有のコード最適化を適用し、高速で正確な性能予測モデルを利用して最適な構成を選択することで、効果的な協調最適化を可能にする。 HBM-PIMでは最大7.68倍(平均2.7倍)、GPUのみの実行ではAttAcc PIMバックエンドでは最大13.17倍(平均5.75倍)のスピードアップを実現している。エンドツーエンドのLCM推論では、AtAcc上のDCCはGPU上でGPT-3とLLaMA-2を最大7.71倍(平均4.88倍)加速する。

論文の概要: A Tensor Compiler for Processing-In-Memory Architectures

関連論文リスト