Fugu-MT 論文翻訳(概要): Network and Compiler Optimizations for Efficient Linear Algebra Kernels in Private Transformer Inference

論文の概要: Network and Compiler Optimizations for Efficient Linear Algebra Kernels in Private Transformer Inference

arxiv url: http://arxiv.org/abs/2512.11135v1
Date: Thu, 11 Dec 2025 21:56:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-15 15:48:11.580447
Title: Network and Compiler Optimizations for Efficient Linear Algebra Kernels in Private Transformer Inference
Title（参考訳）: 自家変圧器推論における効率的な線形代数カーネルのネットワークとコンパイラ最適化
Authors: Karthik Garimella, Negar Neda, Austin Ebel, Nandan Kumar Jha, Brandon Reagen,
Abstract要約: ホモモルフィック暗号化(FHE)は、暗号化されたクエリを直接計算できる。暗号化トランスフォーマー推論の実行は、プログラマが標準カーネルをFHEが提供する制約付き命令セットにマップする必要があるため、難しい。
参考スコア（独自算出の注目度）: 2.725051134664174
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM) based services are primarily structured as client-server interactions, with clients sending queries directly to cloud providers that host LLMs. This approach currently compromises data privacy as all queries must be processed in the cloud and in the clear. Fully Homomorphic Encryption (FHE) is a solution to this data privacy issue by enabling computations directly upon encrypted queries. However, running encrypted transformer inference is challenging as programmers must map standard kernels to the constrained instruction set provided by FHE. In this work, we explore implementations of linear algebra kernels needed for transformer inference in FHE and understand how network optimization can help mitigate FHE costs while remaining performant. We leverage the Orion PyTorch to FHE framework to benchmark several linear algebra kernels in order to profile two linear transformation methods, packed row and BSGS, and find that BSGS outperforms packed row methods by up to $13.7 \times$ at transformer-level scales. We also incorporate network-level pruning strategies that reduce FHE runtimes of feed forward layers by up to $11.46\times$. Furthermore, we extend Orion to include ciphertext-ciphertext matrix-matrix products, a key component in the self-attention blocks. Finally, we perform a roofline analysis of FHE primitives and encrypted linear transformations and find that (SIMD encoded) implementations are memory-bound with primitives having roughly $0.1$ integer operations per byte of DRAM traffic. These findings illustrate the need for exploring alternative encoding schemes and models of computation within CKKS to unlock scalable private transformer inference. We conduct all experiments using the Orion framework which can be found at: https://github.com/baahl-nyu/orion.
Abstract（参考訳）: 大規模言語モデル(LLM)ベースのサービスは、主にクライアントとサーバのインタラクションとして構成されており、クライアントはLLMをホストするクラウドプロバイダに直接クエリを送信する。このアプローチは現在、すべてのクエリをクラウドとクリアで処理する必要があるため、データのプライバシを侵害している。 FHE(Fully Homomorphic Encryption)は、暗号化クエリを直接計算可能にすることで、このデータプライバシ問題に対するソリューションである。しかし、FHEが提供する制約付き命令セットに標準カーネルをマッピングしなければならないため、暗号化トランスフォーマー推論の実行は困難である。本研究では,FHEにおける変圧器推論に必要な線形代数カーネルの実装について検討し,性能を保ちながらネットワーク最適化がFHEコストを緩和する方法について考察する。 We leverage the Orion PyTorch to FHE framework to benchmark several linear algebra kernels to profile two linear transformation method, pack row and BSGS, and found BSGS outforms filled row method at to $113.7 \times$ at transformer-level scales。また、フィードフォワード層のFHEランタイムを最大11.46\times$で削減するネットワークレベルのプルーニング戦略も取り入れています。さらに,Orionを拡張して,自己保持ブロックのキーコンポーネントである,暗号文・暗号文行列行列製品を含める。最後に、FHEプリミティブと暗号化線形変換のルーフライン解析を行い、(SIMDエンコードされた)実装がメモリバウンドであり、DRAMトラフィックのバイト当たり約0.1$の整数演算を持つプリミティブであることを示す。これらの結果は、拡張性のあるプライベートトランスフォーマー推論をアンロックするために、CKKS内の代替符号化スキームと計算モデルの探索の必要性を示している。 Orionフレームワークを使ったすべての実験は、https://github.com/baahl-nyu/orionで見ることができる。

論文の概要: Network and Compiler Optimizations for Efficient Linear Algebra Kernels in Private Transformer Inference

関連論文リスト