Fugu-MT 論文翻訳(概要): Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length

論文の概要: Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length

arxiv url: http://arxiv.org/abs/2512.07019v1
Date: Sun, 07 Dec 2025 22:06:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-09 22:03:54.641571
Title: Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length
Title（参考訳）: 遅延応答理論モデル:応答精度と待ち行列長による大規模言語モデルの評価
Authors: Zhiyu Xu, Jia Liu, Yixin Wang, Yuqi Gu,
Abstract要約: 本稿では、応答精度とCoT長の両方をモデル化した理論応答(LaRT)モデルを提案する。本稿では,遅延特性推定において,より優れた推定精度と短い信頼区間の点で,IRTに対するLaRTの利点を実証する。 LaRT は IRT データセットと異なる LLM ランキングを出力し,予測能力,項目効率,ランキング妥当性,LLM 評価効率など,複数の重要な評価指標で IRT を上回ります。
参考スコア（独自算出の注目度）: 31.900167741342354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The proliferation of Large Language Models (LLMs) necessitates valid evaluation methods to provide guidance for both downstream applications and actionable future improvements. The Item Response Theory (IRT) model with Computerized Adaptive Testing has recently emerged as a promising framework for evaluating LLMs via their response accuracy. Beyond simple response accuracy, LLMs' chain of thought (CoT) lengths serve as a vital indicator of their reasoning ability. To leverage the CoT length information to assist the evaluation of LLMs, we propose the Latency-Response Theory (LaRT) model, which jointly models both the response accuracy and CoT length by introducing a key correlation parameter between the latent ability and the latent speed. We derive an efficient stochastic approximation Expectation-Maximization algorithm for parameter estimation. We establish rigorous identifiability results for the latent ability and latent speed parameters to ensure the statistical validity of their estimation. Through both theoretical asymptotic analyses and simulation studies, we demonstrate LaRT's advantages over IRT in terms of superior estimation accuracy and shorter confidence intervals for latent trait estimation. To evaluate LaRT in real data, we collect responses from diverse LLMs on popular benchmark datasets. We find that LaRT yields different LLM rankings than IRT and outperforms IRT across multiple key evaluation metrics including predictive power, item efficiency, ranking validity, and LLM evaluation efficiency. Code and data are available at https://github.com/Toby-X/Latency-Response-Theory-Model.
Abstract（参考訳）: 大規模言語モデル(LLM)の普及は、下流アプリケーションと実行可能な将来の改善のためのガイダンスを提供するための有効な評価方法を必要とする。コンピュータ適応テストによる項目応答理論(IRT)モデルは、最近、その応答精度によってLCMを評価するための有望なフレームワークとして現れました。単純な応答精度の他に、LLMの思考の連鎖(CoT)の長さは推論能力の重要な指標である。本研究では,LLMの評価を支援するためにCoT長情報を活用するために,遅延応答理論(Latency-Response Theory, LaRT)モデルを提案する。パラメータ推定のための効率的な確率近似予測-最大化アルゴリズムを導出する。我々は、その推定の統計的妥当性を確保するために、潜時能力と潜時速度パラメータの厳密な識別性結果を確立する。理論的漸近解析とシミュレーションの両研究を通じて、遅延特性推定のための優れた推定精度と短い信頼区間の観点から LaRT のIRT に対する優位性を実証する。実データでLaRTを評価するために、人気のあるベンチマークデータセット上で様々なLCMからの応答を収集する。 LaRT は IRT と異なる LLM ランキングを出力し、予測能力、アイテム効率、ランキング妥当性、LLM 評価効率などを含む複数の重要な評価指標で IRT を上回ります。コードとデータはhttps://github.com/Toby-X/Latency-Response-Theory-Modelで入手できる。

論文の概要: Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length

関連論文リスト