Fugu-MT 論文翻訳(概要): Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

論文の概要: Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

arxiv url: http://arxiv.org/abs/2604.16368v2
Date: Tue, 21 Apr 2026 17:25:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 02:32:13.935504
Title: Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM
Title（参考訳）: Apple 上でのポーランド語モデルに対するクロスファミリックな投機的デコード -Silicon: UAG-Extended MLX-LMによる Bielik~11B の実証評価-
Authors: Krzysztof Fonal,
Abstract要約: MLX-LMフレームワークをUniversal Assisted Generation (UAG)で拡張し、Apple Silicon上でクロストケナイザの投機的復号を可能にする。ポーランド語の3つのデータセット(Wikipedia、pl_alpaca、synthetic)の実験では、2, 4, 6のドラフト長kを用いて、ナイーブとコンテキスト対応のトークン翻訳を比較している。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding accelerates LLM inference by using a small draft model to propose k candidate tokens for a target model to verify. While effective for same-tokenizer pairs on high-bandwidth GPUs, its applicability to cross-family pairs with mismatched tokenizers and consumer-grade unified memory remains underexplored. We extend the MLX-LM framework with Universal Assisted Generation (UAG) to enable cross-tokenizer speculative decoding on Apple Silicon. We evaluate Bielik 11B-Instruct (Mistral-based) as the target model, paired with three draft models: Bielik 1.5B (Qwen-based with custom tokenizer), Qwen2.5-1.5B, and Llama 3.2-1B. Experiments on three Polish-language datasets (Wikipedia, pl_alpaca, synthetic) use draft lengths k in {2, 4, 6} to compare naive and context-aware token translation. Results show: (1) context-aware translation consistently improves acceptance rates across all configurations; (2) the Polish-specialized Bielik 1.5B achieves lower acceptance than general-purpose Qwen2.5 and Llama 3.2 drafters; (3) throughput on Apple Silicon is content-dependent, reaching 1.7x speedup for structured text but failing for varied instructions; and (4) verification cost on unified memory does not amortize as theory predicts because both models are memory-bandwidth bound, making sequential drafting expensive relative to batched verification. We propose a hardware-aware speedup formula and characterize conditions for cross-family speculative decoding on Apple Silicon. This is the first systematic evaluation of cross-family speculative decoding for Polish LLMs and the first empirical study of UAG-based decoding on unified memory architectures.
Abstract（参考訳）: 投機的復号化はLLM推論を小さなドラフトモデルを用いて高速化し、ターゲットモデルに対するk候補トークンの提案を行う。高帯域GPU上で同じトケナイザペアに有効であるが、不正なトークンライザとコンシューマグレードの統一メモリとのクロスファミリーペアの適用性はまだ未定である。 MLX-LMフレームワークをUniversal Assisted Generation (UAG)で拡張し、Apple Silicon上でクロストケナイザの投機的復号を可能にする。我々は, Bielik 11B-Instruct (Mistral-based) をターゲットモデルとして評価し, Bielik 1.5B (Qwen-based with custom tokenizer), Qwen2.5-1.5B, Llama 3.2-1B の3つのドラフトモデルと組み合わせた。ポーランド語の3つのデータセット(Wikipedia、pl_alpaca、synthetic)の実験では、naive と context-aware のトークン翻訳を比較するために、 {2, 4, 6} のドラフト長 k を使用する。その結果,(1) コンテクスト対応翻訳は,すべての構成における受入率を一貫して向上させる; (2) ポーランド特化Bielik 1.5Bは汎用的なQwen2.5やLlama 3.2のドラフトよりも低い受入率を達成する; 3) Apple Siliconのスループットはコンテンツ依存であり,構造化テキストでは1.7倍のスピードアップに達するが,さまざまな命令では失敗する;(4) 理論が予測するように,統一メモリの検証コストは記憶帯域境界であるので,いずれのモデルもバッチ検証に比べて高価である。ハードウェア対応の高速化公式を提案し,Apple Silicon上でのクロスファミリー投機的復号化条件を特徴付ける。これはポーランドのLLMにおけるクロスファミリー投機的復号法の最初の体系的評価であり、統一メモリアーキテクチャにおけるUAGに基づく復号法の最初の実証的研究である。

論文の概要: Cross-Family Speculative Decoding for Polish Language Models on Apple~Silicon: An Empirical Evaluation of Bielik~11B with UAG-Extended MLX-LM

関連論文リスト