Fugu-MT 論文翻訳(概要): Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length

論文の概要: Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length

arxiv url: http://arxiv.org/abs/2512.07019v2
Date: Thu, 11 Dec 2025 02:45:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-12 14:11:15.184448
Title: Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length
Title（参考訳）: 遅延応答理論モデル:応答精度と待ち行列長による大規模言語モデルの評価
Authors: Zhiyu Xu, Jia Liu, Yixin Wang, Yuqi Gu,
Abstract要約: LLM-Response Theory(LaRT)を提案し、応答精度とCoT長を、潜時能力、潜時速度、それらの間のキー相関パラメータを導入してモデル化する。 LaRTはIRTと異なるLLMランキングを獲得し、予測能力、アイテム効率、ランキングの妥当性、評価効率などを含む複数の主要な評価指標でIRTを上回っている。
参考スコア（独自算出の注目度）: 31.900167741342354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The proliferation of Large Language Models (LLMs) necessitates valid evaluation methods to guide downstream applications and actionable future improvements. The Item Response Theory (IRT) has recently emerged as a promising framework for evaluating LLMs via their response accuracy. Beyond simple response accuracy, LLMs' chain of thought (CoT) lengths serve as a vital indicator of their reasoning ability. To leverage the CoT length information to assist the evaluation of LLMs, we propose Latency-Response Theory (LaRT) to jointly model the response accuracy and CoT length by introducing the latent ability, latent speed, and a key correlation parameter between them. We derive an efficient estimation algorithm and establish rigorous identifiability results for the population parameters to ensure the statistical validity of estimation. Theoretical asymptotic analyses and simulation studies demonstrate LaRT's advantages over IRT in terms of higher estimation accuracy and shorter confidence intervals for latent traits. A key finding is that the asymptotic estimation precision of the latent ability under LaRT exceeds that of IRT whenever the latent ability and latent speed are correlated. We collect real responses from diverse LLMs on popular benchmark datasets. The application of LaRT reveals a strong negative correlation between the latent ability and latent speed in all benchmarks, with stronger correlation for more difficult benchmarks. This finding supports the intuition that higher reasoning ability correlates with slower speed and longer response latency. LaRT yields different LLM rankings than IRT and outperforms IRT across multiple key evaluation metrics including predictive power, item efficiency, ranking validity, and LLM evaluation efficiency. Code and data are available at https://github.com/Toby-X/Latency-Response-Theory-Model.
Abstract（参考訳）: 大規模言語モデル(LLM)の普及は、下流アプリケーションと実行可能な将来の改善を導く有効な評価方法を必要とする。項目応答理論(IRT)は、最近、その応答精度を通じてLCMを評価するための有望なフレームワークとして登場した。単純な応答精度の他に、LLMの思考の連鎖(CoT)の長さは推論能力の重要な指標である。本研究では,LLMの評価を支援するために,遅延応答理論 (Latency-Response Theory, LaRT) を提案する。我々は,効率的な推定アルゴリズムを導出し,推定の統計的妥当性を確保するために,人口パラメータの厳密な識別性を示す。理論的漸近解析とシミュレーション研究は、遅延特性に対する高い推定精度と短い信頼区間の観点から、IRTに対するLaRTの優位性を示す。重要な発見は、潜伏能力と潜伏速度が相関するたびに LaRT の潜伏能力の漸近推定精度が IRT のそれを超えることである。人気のあるベンチマークデータセット上で,多様なLCMから実応答を収集する。 LaRTの適用により、全てのベンチマークにおいて潜時能力と潜時速度の間に強い負の相関が示され、より難しいベンチマークに対して強い相関が示される。この発見は、高い推論能力は、速度が遅く、レスポンスのレイテンシが長いことと関連している、という直感を支持する。 LaRTはIRTと異なるLLMランキングを獲得し、予測能力、アイテム効率、ランキング妥当性、LLM評価効率などを含む複数の主要な評価指標でIRTを上回っている。コードとデータはhttps://github.com/Toby-X/Latency-Response-Theory-Modelで入手できる。

論文の概要: Latency-Response Theory Model: Evaluating Large Language Models via Response Accuracy and Chain-of-Thought Length

関連論文リスト