Fugu-MT 論文翻訳(概要): Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

論文の概要: Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

arxiv url: http://arxiv.org/abs/2606.16890v1
Date: Mon, 15 Jun 2026 16:00:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:34.745943
Title: Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering
Title（参考訳）: 構成推論の深さが臨床AIの失敗を予測する:電子カルテ質問応答におけるトランスフォーマー構成限界に反する経験的証拠
Authors: Sanjay Basu,
Abstract要約: EHR質問応答対におけるホップ数とともにモノトン減少を示す。ホップカウント(Hop count)は、EHR質問応答における大モデル誤差の理論的動機付きクロスアーキテクチャ予測器である。
参考スコア（独自算出の注目度）: 0.6768558752130311
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy -- the number of distinct reasoning steps required to answer a clinical question from an EHR -- as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p<0.001); and gpt-5.4-2026-03-05 confirms it (37.8% to 23.5%; OR 0.80 [0.66,0.98], p=0.027). A pre-specified context-sufficiency audit shows higher-hop questions are not differentially disadvantaged by EHR truncation (answerability 93-95% at hops 2-4 vs. 79% at hop=1), so the decline reflects compositional reasoning difficulty. Extended thinking did not significantly flatten the accuracy-depth curve across three reasoning conditions, and thinking-token usage scaled with hop count (r=0.31, p<0.0001), consistent with the predicted O(k) computational requirement. Hop count is thus a theory-motivated, cross-architecture predictor of large-language-model error on EHR question answering, with direct implications for deployment risk stratification of clinical AI.
Abstract（参考訳）: 集約精度ベンチマークは、大きな言語モデルが電子健康記録(EHR)でどのように失敗するかという体系的な構造を隠蔽している。変圧器の構成限界に関する理論的結果から, モデル故障の予測因子として, EHRから臨床問題に答えるために必要な, 明確な推論ステップの数を, 事前に特定したホップ数分類法を導入する。我々は,4ホップレベルに313のクリニック生成型MedAlign EHR質問応答ペアをアノテートし,モデル内アブレーション(クロードソネット-4-6,ゼロショット対拡張思考)とクロスアーキテクチャ複製(gpt-4o,gpt-5.4-2026-03-05,ゼロショット)で301の質問を評価した。クロードソネットゼロショットは30.6%(ホップ=1)から17.6%(ホップ=4) (Cochran-Armitage z=-2.30, p=0.011)、OR per hop 0.72, 95% CI [0.56,0.92], p=0.008)、GPT-4oは37.8%から14.7%、OR 0.58 [0.45,0.75], p<0.001)、gpt-5.4-2026-03-05(37.8%から23.5%)、OR 0.80 [0.66,0.98, p=27)である。事前に特定された文脈充足度監査では、高次ホップ質問は EHR 切り離し(ホップ 2-4 対ホップ 79% の解答可能性 93-95% )、構成的理由づけの難しさを反映している。拡張思考は3つの推論条件で精度-深度曲線を著しくフラットにせず、予測されたO(k)計算要件と一致するホップ数 (r=0.31, p<0.0001) とスケールした。ホップカウントは、EHR質問応答における大言語モデルの誤りを理論的に動機づけたクロスアーキテクチャ予測器であり、臨床AIのデプロイメントリスク層化に直接的な意味を持つ。

関連論文リスト

Automated Proving of Shannon-Type Entropy Inequalities via Fine-Tuned Language Models and Guided Tree Search [50.16356451328644]
シャノン型エントロピーの不等式を証明することは情報理論の基本的な課題である。我々は,原子実証のステップを微調整した小規模大規模言語モデルがこのプロセスを自動化することができるか検討する。 GPT-5.5は0ショットプロンプトで1.7%のサンプルを解き、Psitipは33.3%のサンプルを解いた。
論文参考訳（メタデータ） (2026-06-04T05:43:12Z)
Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction [51.56484100374058]
テキスト検出器は、事前訓練された典型軸を増幅する。タスク監督前の生エンコーダでは、3つのアーキテクチャでNYT-vs-HC3 AUROC 0.806/0.944/0.834を達成する。 RoBERTaベースでは、生のプロジェクションは微調整を超えるが、RoBERTaベースでは、フル微調整は、試験された流線型人口の双方で生よりも識別を小さくする。
論文参考訳（メタデータ） (2026-05-20T19:08:38Z)
ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV [0.0]
推論ベンチマークはクリーンインプットの臨床的パフォーマンスを測定する。我々は, 否定, 時間性, 家族反対の帰属が正しい答えを誤ったものに戻すことができる, 実際の EHR ノートを検索することで, 推論の段階を評価する。 EpiKGは、アサーションラベルと時間性タグを患者の知識グラフに格納し、質問意図による検索をルーティングする。
論文参考訳（メタデータ） (2026-05-11T18:47:52Z)
The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation [1.8345614451086532]
RLHF 対応言語モデルは TruthfulQA 上で応答均質化を示す。 40-79%の質問は、10のi.i.d.サンプルに対して単一のセマンティッククラスタを生成する。
論文参考訳（メタデータ） (2026-03-25T09:35:15Z)
ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions [5.63130104359934]
ThReadMed-QAは、r/AskDocsから抽出された2,437人の患者を検索する会話スレッドのベンチマークである。我々は,238の会話の階層化テスト分割に基づいて,最先端のLLMを5つ評価した。最も強いモデルであるGPT-5でさえ41.2%の完全正解しか得られない。
論文参考訳（メタデータ） (2026-03-11T20:17:57Z)
PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
大規模言語モデル(LLM)は、標準化された検査において専門家レベルの性能を達成したが、複数の選択精度は現実の臨床的有用性や安全性を十分に反映していない。我々は、未確認患者の質問に対して、専門家のルーブリックを作成するための、ループ内人間パイプラインを開発した。 LLM-as-a-judge フレームワークを用いて,22のプロプライエタリおよびオープンソース LLM の評価を行い,臨床完全性,事実精度,Web-search 統合について検討した。
論文参考訳（メタデータ） (2026-03-02T00:50:39Z)
Causal Understanding by LLMs: The Role of Uncertainty [43.87879175532034]
近年の論文では、LLMは因果関係分類においてほぼランダムな精度を達成している。因果的事例への事前曝露が因果的理解を改善するか否かを検討する。
論文参考訳（メタデータ） (2025-09-24T13:06:35Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
新規データや不完全データでは,性能が著しく低下することがわかった。これらの結果は、厳密な論理的推論に対するリコールへの依存を浮き彫りにした。本稿では,情報不足によって引き起こされる幻覚を利用して推論ギャップを明らかにする,Math-RoBと呼ばれる新しいベンチマークを提案する。
論文参考訳（メタデータ） (2025-03-06T15:36:06Z)
CRTRE: Causal Rule Generation with Target Trial Emulation Framework [47.2836994469923]
ターゲットトライアルエミュレーションフレームワーク(CRTRE)を用いた因果ルール生成という新しい手法を提案する。 CRTREは、アソシエーションルールの因果効果を推定するためにランダム化トライアル設計原則を適用している。次に、病気発症予測などの下流アプリケーションにそのような関連ルールを組み込む。
論文参考訳（メタデータ） (2024-11-10T02:40:06Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。