Fugu-MT 論文翻訳(概要): QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

論文の概要: QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

arxiv url: http://arxiv.org/abs/2601.00679v1
Date: Fri, 02 Jan 2026 13:05:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-05 15:04:33.573604
Title: QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models
Title（参考訳）: QSLM:スパイク駆動型言語モデルのための階層検索戦略を備えたパフォーマンスとメモリ対応量子化フレームワーク
Authors: Rachmad Vidya Wicaksana Putra, Pasindu Wickramasinghe, Muhammad Shafique,
Abstract要約: 大規模言語モデル(LLM)は、多くの自然言語タスクを解決するための顕著なAIモデルとして登場してきた。計算コストが大きく、メモリフットプリントが膨大で、処理能力/エネルギーが高いため、組み込みデプロイメントでは困難である。本研究では,事前学習したSLMを圧縮するための自動量子化を行う新しいフレームワークを提案する。
参考スコア（独自算出の注目度）: 3.1061484260786014
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have been emerging as prominent AI models for solving many natural language tasks due to their high performance (e.g., accuracy) and capabilities in generating high-quality responses to the given inputs. However, their large computational cost, huge memory footprints, and high processing power/energy make it challenging for their embedded deployments. Amid several tinyLLMs, recent works have proposed spike-driven language models (SLMs) for significantly reducing the processing power/energy of LLMs. However, their memory footprints still remain too large for low-cost and resource-constrained embedded devices. Manual quantization approach may effectively compress SLM memory footprints, but it requires a huge design time and compute power to find the quantization setting for each network, hence making this approach not-scalable for handling different networks, performance requirements, and memory budgets. To bridge this gap, we propose QSLM, a novel framework that performs automated quantization for compressing pre-trained SLMs, while meeting the performance and memory constraints. To achieve this, QSLM first identifies the hierarchy of the given network architecture and the sensitivity of network layers under quantization, then employs a tiered quantization strategy (e.g., global-, block-, and module-level quantization) while leveraging a multi-objective performance-and-memory trade-off function to select the final quantization setting. Experimental results indicate that our QSLM reduces memory footprint by up to 86.5%, reduces power consumption by up to 20%, maintains high performance across different tasks (i.e., by up to 84.4% accuracy of sentiment classification on the SST-2 dataset and perplexity score of 23.2 for text generation on the WikiText-2 dataset) close to the original non-quantized model while meeting the performance and memory constraints.
Abstract（参考訳）: 大規模言語モデル(LLM)は、高パフォーマンス(例えば精度)と、与えられた入力に対する高品質な応答を生成する能力によって、多くの自然言語タスクを解決するための顕著なAIモデルとして登場してきた。しかし、その大きな計算コスト、巨大なメモリフットプリント、高い処理能力/エネルギーは、組み込みデプロイメントでは困難である。いくつかの小さなLLMの一方で、近年の研究では、LLMの処理パワー/エネルギーを大幅に削減するスパイク駆動言語モデル(SLM)が提案されている。しかし、メモリフットプリントは、低コストでリソースに制約のある組み込みデバイスでは、まだ大きすぎる。手動量子化アプローチは、SLMメモリフットプリントを効果的に圧縮するが、各ネットワークの量子化設定を見つけるのに巨大な設計時間と計算能力を必要とするため、異なるネットワーク、性能要求、メモリ予算を扱うために、このアプローチをスケールできない。このギャップを埋めるために、我々は、性能とメモリ制約を満たしつつ、事前訓練されたSLMを圧縮するための自動量子化を行う新しいフレームワークQSLMを提案する。これを実現するために、QSLMはまず、与えられたネットワークアーキテクチャの階層構造と量子化下のネットワーク層の感度を識別し、最終的な量子化設定を選択するために、多目的のパフォーマンスとメモリのトレードオフ関数を活用しながら、タインド量子化戦略(例えば、グローバル、ブロック、モジュールレベルの量子化)を用いる。実験結果から、当社のQSLMはメモリフットプリントを最大86.5%削減し、消費電力を最大20%削減し、異なるタスク(SST-2データセットでの感情分類の最大84.4%精度、WikiText-2データセットでのテキスト生成のパープレキシティスコア23.2)で性能とメモリ制約を満たしながら、元の非量子化モデルに近い性能を維持していることがわかった。

論文の概要: QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

関連論文リスト