Fugu-MT 論文翻訳(概要): Luna-2: Scalable Single-Token Evaluation with Small Language Models

論文の概要: Luna-2: Scalable Single-Token Evaluation with Small Language Models

arxiv url: http://arxiv.org/abs/2602.18583v1
Date: Fri, 20 Feb 2026 19:43:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-24 17:42:02.175363
Title: Luna-2: Scalable Single-Token Evaluation with Small Language Models
Title（参考訳）: Luna-2:小さな言語モデルによるスケーラブルなシングルトークン評価
Authors: Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel, Shuai Shao, Yash Sheth,
Abstract要約: リアルタイムガードレールは正確で安価で高速な評価を必要とする。今日のデフォルトの LLM-as-a-judge (LLMAJ) は遅く、高価で、運用上非決定論的である。本稿では,デコーダのみの小型言語モデル(SLM)を決定論的評価モデルに活用する新しいアーキテクチャであるLuna-2を提案する。
参考スコア（独自算出の注目度）: 2.256035939593399
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g. toxicity, hallucination, tool selection quality, etc.) at an accuracy at par or higher than LLMAJ using frontier LLMs while drastically reducing the cost and latency of computation. Each metric is implemented as a lightweight LoRA/PEFT head on top of a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU, deployable locally next to AI systems in a privacy-preserving and latency optimizing manner. Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art LLM-based evaluators while reducing inference cost by over 80x and latency by over 20x. In this paper, we outline the model architecture, training methodology and report real-world empirical results on accuracy, latency, and throughput results. In production, Luna-2 is protecting 100M+ AI sessions and processing over 100B tokens per month for our customers with eval cost savings of over $30M annually.
Abstract（参考訳）: リアルタイムガードレールは正確で安価で高速な評価を必要とするが、今日のデフォルトのLLMAJ(LLM-as-a-judge)は遅く、高価で、マルチトークン世代のために運用的に非決定論的である。本稿では,デコーダのみの小型言語モデル(SLM)を決定論的評価モデルに活用し,計算のコストとレイテンシを大幅に低減しつつ,LLMAJよりも高い精度で,複雑なタスク固有のLLMAJメトリクス(例えば毒性,幻覚,ツール選択品質など)を高い精度で確実に計算する,新しいアーキテクチャであるLuna-2を提案する。各メトリックは、共有SLMバックボーン上に軽量のLoRA/PEFTヘッドとして実装されており、数百の特別なメトリクスを単一のGPU上で並列に実行し、プライバシ保存とレイテンシ最適化の方法でAIシステムにローカルにデプロイすることができる。コンテンツの安全性と幻覚ベンチマーク全体にわたって、Luna-2は最先端のLCMベースの評価器の精度と、推論コストを80倍、レイテンシを20倍に削減する。本稿では,モデルアーキテクチャの概要,トレーニング手法,および実世界の実験結果について,精度,レイテンシ,スループットについて概説する。実運用では、Luna-2は1億以上のAIセッションを保護し、年間3000万ドル以上のevalコストの節約で、月に100億以上のトークンを処理しています。

論文の概要: Luna-2: Scalable Single-Token Evaluation with Small Language Models

関連論文リスト