Fugu-MT 論文翻訳(概要): DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models

論文の概要: DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models

arxiv url: http://arxiv.org/abs/2603.23514v1
Date: Thu, 05 Mar 2026 20:49:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:12.998533
Title: DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models
Title（参考訳）: DepthCharge: 大規模言語モデルにおける深さ依存的知識を測定するためのドメインに依存しないフレームワーク
Authors: Alexander Sheppert,
Abstract要約: 大きな言語モデルは一般的な質問に答えるときに有能に見えるが、ドメイン固有の詳細にプッシュされると失敗することが多い。 3つのイノベーションを通じて知識の深さを測定するドメインに依存しないフレームワークであるDepthChargeを紹介します。モデルが実際に言及している概念に基づいてフォローアップ質問を生成する適応的探索、権威のある情報源からのオンデマンドの事実検証、あらゆる深さのサンプルサイズが一定である生存統計。
参考スコア（独自算出の注目度）: 51.56484100374058
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models appear competent when answering general questions but often fail when pushed into domain-specific details. No existing methodology provides an out-of-the-box solution for measuring how deeply LLMs can sustain accurate responses under adaptive follow-up questioning across arbitrary domains. We present DepthCharge, a domain-agnostic framework that measures knowledge depth through three innovations: adaptive probing that generates follow-up questions based on concepts the model actually mentions, on-demand fact verification from authoritative sources, and survival statistics with constant sample sizes at every depth level. The framework can be deployed on any knowledge domain with publicly verifiable facts, without requiring pre-constructed test sets or domain-specific expertise. DepthCharge results are relative to the evaluator model used for answer checking, making the framework a tool for comparative evaluation rather than absolute accuracy certification. Empirical validation across four diverse domains (Medicine, Constitutional Law, Ancient Rome, and Quantum Computing) with five frontier models demonstrates that DepthCharge reveals depth-dependent performance variation hidden by standard benchmarks. Expected Valid Depth (EVD) ranges from 3.45 to 7.55 across model-domain combinations, and model rankings vary substantially by domain, with no single model dominating all areas. Cost-performance analysis further reveals that expensive models do not always achieve deeper knowledge, suggesting that domain-specific evaluation is more informative than aggregate benchmarks for model selection in professional applications.
Abstract（参考訳）: 大きな言語モデルは一般的な質問に答えるときに有能に見えるが、ドメイン固有の詳細にプッシュされると失敗することが多い。既存の方法論では、任意のドメインにまたがる適応的なフォローアップ質問の下で、LLMがいかに正確な応答を維持できるかを測るアウト・オブ・ザ・ボックスのソリューションを提供していません。本稿では,3つのイノベーションを通じて知識深度を測定するドメイン非依存のフレームワークであるDepthChargeを紹介する。モデルが実際に言及している概念に基づいたフォローアップ質問を生成する適応的探索,権威のある情報源からのオンデマンド事実検証,各深度レベルで一定のサンプルサイズを持つ生存統計である。フレームワークは、事前に構築されたテストセットやドメイン固有の専門知識を必要とせずに、公に検証可能な事実を持った知識ドメインにデプロイすることができる。 DepthChargeの結果は、回答チェックに使用される評価モデルと相対的であり、フレームワークを絶対精度認証ではなく比較評価のためのツールにする。 5つのフロンティアモデルを持つ4つの異なるドメイン(メディシン、憲法法、古代ローマ、量子コンピューティング)にまたがる実証的な検証は、DepthChargeが標準ベンチマークで隠された深さ依存のパフォーマンスのばらつきを明らかにしていることを示している。 Valid Depth (EVD) はモデルドメインの組み合わせによって 3.45 から 7.55 まで変化しており、モデルランキングはドメインによって大きく異なる。コストパフォーマンス分析により、高価なモデルは必ずしも深い知識を得られないことが明らかとなり、プロのアプリケーションにおけるモデル選択のための集約ベンチマークよりもドメイン固有の評価の方が有益であることが示唆された。

論文の概要: DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models

関連論文リスト