Fugu-MT 論文翻訳(概要): Latent Performance Profiling of Large Language Models

論文の概要: Latent Performance Profiling of Large Language Models

arxiv url: http://arxiv.org/abs/2605.30018v1
Date: Thu, 28 May 2026 14:41:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.399081
Title: Latent Performance Profiling of Large Language Models
Title（参考訳）: 大規模言語モデルの潜在性能プロファイリング
Authors: Tanmoy Chakraborty, Ayan Sengupta, Suparna Bhattacharya, Partha Pratim Chakrabarti, Amlan Chakrabarti, Supratik Chakraborty, Partha Pratim Das, Lipika Dey, Richa Singh, Mayank Vatsa,
Abstract要約: textbfLatent Performance Profiling (LPP) - 隠れたアクティベーションと出力分布からタスクに依存しない診断を導出するフレームワーク。静的精度スコアとは異なり、LPPは同様のサイズのモデル間で安定でアーキテクチャに敏感なシグネチャを提供する。類似のベンチマークスコアを持つモデルは、エントロピーや適応性の違いなど、対照的な潜在プロファイルを示すことができることを示す。
参考スコア（独自算出の注目度）: 47.009623327601226
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) frequently achieve impressive scores on standardized benchmarks, yet accuracy alone offers a limited view of their capabilities. Evaluating open-source LLMs through leaderboards faces persistent issues like data contamination, narrow task scope, and weak alignment with real-world reliability. Benchmark-based evaluations such as MMLU PRO, BBH, or IFEval primarily capture \textit{what} a model outputs on fixed test sets, not \textit{how} it processes information, calibrates uncertainty, or structures internal knowledge. In this article, we advocate for a shift from benchmark-centric evaluation toward a complementary, \textit{state-centered intrinsic assessment} of LLMs. To this end, we introduce \textbf{Latent Performance Profiling (LPP)} -- a framework that derives task-agnostic diagnostics from hidden activations and output distributions. LPP defines a set of scalar metrics on a model's latent representations and dynamics, revealing scale-independent traits that enable interpretable comparisons and uncover hidden vulnerabilities. Unlike static accuracy scores, LPP provides stable, architecture-sensitive signatures across models of similar size. With extensive empirical analyses across eight LLMs, spanning a size range of 0.5B-14B, we demonstrate that models with similar benchmark scores can exhibit contrasting latent profiles, such as differences in entropy or adaptability. Guided by these insights, we design synthetic probes for uncertainty and symbolic reasoning that align with intrinsic metrics while decoupling from leaderboard bias. We recommend that reporting LPP alongside benchmarks provides a deeper, interpretable understanding of model behavior, enabling more reliable model selection, safety assessment, and evaluation beyond surface-level accuracy.
Abstract（参考訳）: 大規模言語モデル(LLM)は、しばしば標準化されたベンチマークで印象的なスコアを得るが、精度だけでその能力の限られたビューを提供する。リーダボードによるオープンソースのLCMの評価は、データ汚染、タスクスコープの狭さ、現実の信頼性との整合性の弱さといった、永続的な問題に直面します。 MMLU PRO、BBH、IFEvalなどのベンチマークベースの評価は、情報処理や不確かさの校正、内部知識の構造などではなく、固定されたテストセット上で出力されるモデルである \textit{what} を主にキャプチャする。本稿では,ベンチマーク中心の評価から LLM の補完的な \textit{state-centered intrinsic Assessment} への移行を提唱する。この目的のために、隠れたアクティベーションと出力分布からタスクに依存しない診断を導出するフレームワークである \textbf{Latent Performance Profiling (LPP) を導入する。 LPPは、モデルの潜在表現とダイナミクスに関するスカラーメトリクスのセットを定義し、解釈可能な比較を可能にし、隠れた脆弱性を明らかにするスケールに依存しない特性を明らかにする。静的精度スコアとは異なり、LPPは同様のサイズのモデル間で安定でアーキテクチャに敏感なシグネチャを提供する。 0.5B-14Bの範囲にまたがる8つのLDMの広範な実験解析により,類似のベンチマークスコアを持つモデルは,エントロピーや適応性の違いなど,対照的な潜在プロファイルを示すことができることを示した。これらの知見に導かれて、私たちは、リーダーボードバイアスから分離しながら、本質的なメトリクスと整合する不確実性と象徴的な推論のための合成プローブを設計する。ベンチマークとともにLPPを報告することで、モデル動作をより深く理解し、より信頼性の高いモデル選択、安全性評価、表面レベルの精度以上の評価が可能になることを推奨する。

論文の概要: Latent Performance Profiling of Large Language Models

関連論文リスト