Fugu-MT 論文翻訳(概要): LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

論文の概要: LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

arxiv url: http://arxiv.org/abs/2605.17653v1
Date: Sun, 17 May 2026 21:10:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:48.300268
Title: LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models
Title（参考訳）: LLMForge:エッジ言語モデルに対する無限の注意を伴うマルチバックエンドハードウェア対応ニューラルアーキテクチャ検索
Authors: Xinting Jiang, Junyi Luo, Ruichen Qi, Kauna Lei, Ben Laurie, Gregory Kielian, Mehdi Saligane,
Abstract要約: LLMForgeはハードウェア対応のニューラルアーキテクチャ検索フレームワークである。 Infinite-Head Attention (IHA)は、クエリヘッド数、KVグループ、ヘッド毎のクエリ/キーと値次元を分離する。 Forge-Formerはアーキテクチャベースのベンチマークパフォーマンスを上回っている。
参考スコア（独自算出の注目度）: 0.5765637715313356
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal budgets that make architectural choice and accelerator-specific cost central to efficient inference. We present LLMForge, a hardware-aware neural architecture search (NAS) framework whose three composable contributions together make edge-LM architecture search hardware-conditioned, since different substrates impose different hardware cost bottlenecks. Infinite-Head Attention (IHA) decouples the number of query heads, KV groups, and per-head query/key and value dimensions, expanding the feasible per-layer attention configuration space by approximately 400x over grouped-query attention within our search-space ranges. Forge-Former, an encoder-based surrogate for ranking architectural candidates, outperforms MLP and random-forest baselines. Forge-DSE, an NSGA-II-based design-space-exploration engine, pairs Forge-Former with a multi-backend hardware cost model spanning GPUs, systolic accelerators, and ring-dataflow edge accelerators. Across four different hardware substrates, the searches converge to visibly different architectures whose shapes track each substrate's cost bottleneck. On the multi-chip ring substrate, our co-search returns three 300M-scale deployment-aware variants on the Pareto front. Each is re-trained on FineWeb-Edu-10BT under matched recipe against SmolLM2-360M and Qwen-0.5B architecture baselines. The accurate variant has the lowest validation loss 2.798 and competitive benchmark performance with fewer parameters, the energy-optimized variant lowers energy per token by 40%, and the latency-optimized variant lowers TTFT and TPOT by 43%.
Abstract（参考訳）: サブビリオンパラメータトランスフォーマー言語モデルは、デバイス上の推論のプライバシ、レイテンシ、運用コストのアドバンテージが厳しいメモリ帯域幅、エネルギー、熱予算によって制約され、アーキテクチャの選択とアクセル固有のコストが効率的な推論の中心となるエッジデバイスにますますデプロイされる。ハードウェア対応ニューラルアーキテクチャサーチ(NAS)フレームワークであるLLMForgeについて述べる。 Infinite-Head Attention (IHA)は、クエリヘッド数、KVグループ数、ヘッドごとのクエリ/キーと値次元を分離し、サーチスペース内のグループ付きクエリーの注意領域を約400倍拡張する。アーキテクチャ候補をランク付けするためのエンコーダベースのサロゲートであるForge-Formerは、MLPとランダムフォレストベースラインを上回っている。 NSGA-IIベースのデザインスペース探索エンジンであるForge-DSEは、Forge-FormerとGPU、シストリックアクセラレーション、リングデータフローエッジアクセラレーションにまたがるマルチバックエンドハードウェアコストモデルを組み合わせたものだ。 4つの異なるハードウェア基板にまたがって、探索は、各基板のコストボトルネックを追跡する形状の視覚的に異なるアーキテクチャに収束する。マルチチップリング基板上で、我々の共同研究はパレートフロントに3つの300万スケールの展開対応の変種を返却する。 FineWeb-Edu-10BT では SmolLM2-360M と Qwen-0.5B のアーキテクチャベースラインにマッチしたレシピでトレーニングされている。正確な変種は、最小の検証損失 2.798 と、より少ないパラメータを持つ競合ベンチマーク性能を持ち、エネルギー最適化された変種はトークン当たりのエネルギーを40%低下させ、レイテンシ最適化された変種はTTFTとTPOTを43%低下させる。

論文の概要: LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

関連論文リスト