Fugu-MT 論文翻訳(概要): Reliable Curation of EHR Dataset via Large Language Models under Environmental Constraints

論文の概要: Reliable Curation of EHR Dataset via Large Language Models under Environmental Constraints

arxiv url: http://arxiv.org/abs/2511.00772v1
Date: Sun, 02 Nov 2025 02:45:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:26.926041
Title: Reliable Curation of EHR Dataset via Large Language Models under Environmental Constraints
Title（参考訳）: 環境制約下における大規模言語モデルによるERHデータセットの信頼性評価
Authors: Raymond M. Xiong, Panyu Chen, Tianze Dong, Jian Lu, Benjamin Goldstein, Danyang Zhuo, Anru R. Zhang,
Abstract要約: CELECは、大規模な言語モデル(LLM)を利用した、自動EHRデータ抽出と分析のためのフレームワークである。 EHRベンチマークのサブセットでは、CELECの実行精度は低レイテンシ、コスト効率、厳格なプライバシを維持しながら達成される。
参考スコア（独自算出の注目度）: 11.502074619844125
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Electronic health records (EHRs) are central to modern healthcare delivery and research; yet, many researchers lack the database expertise necessary to write complex SQL queries or generate effective visualizations, limiting efficient data use and scientific discovery. To address this barrier, we introduce CELEC, a large language model (LLM)-powered framework for automated EHR data extraction and analytics. CELEC translates natural language queries into SQL using a prompting strategy that integrates schema information, few-shot demonstrations, and chain-of-thought reasoning, which together improve accuracy and robustness. On a subset of the EHRSQL benchmark, CELEC achieves execution accuracy comparable to prior systems while maintaining low latency, cost efficiency, and strict privacy by exposing only database metadata to the LLM. CELEC also adheres to strict privacy protocols: the LLM accesses only database metadata (e.g., table and column names), while all query execution occurs securely within the institutional environment, ensuring that no patient-level data is ever transmitted to or shared with the LLM. Ablation studies confirm that each component of the SQL generation pipeline, particularly the few-shot demonstrations, plays a critical role in performance. By lowering technical barriers and enabling medical researchers to query EHR databases directly, CELEC streamlines research workflows and accelerates biomedical discovery.
Abstract（参考訳）: 電子健康記録(EHR)は、現代の医療提供と研究の中心であるが、多くの研究者は、複雑なSQLクエリを書いたり、効果的な視覚化を生成し、効率的なデータ使用と科学的発見を制限するのに必要なデータベースの専門知識を欠いている。この障壁に対処するため、我々は大規模な言語モデル(LLM)を利用した、自動EHRデータ抽出と分析のためのフレームワークであるCELECを紹介した。 CELECは、スキーマ情報、数発のデモ、チェーンオブ思考推論を統合するプロンプト戦略を使用して、自然言語クエリをSQLに変換することで、正確性と堅牢性を向上させる。 EHRSQLベンチマークのサブセットでは、CELECはLLMにデータベースメタデータのみを公開することで、低レイテンシ、コスト効率、厳格なプライバシを維持しながら、以前のシステムに匹敵する実行精度を達成する。 CELECは厳格なプライバシープロトコルにも準拠する: LLMはデータベースメタデータ(例えばテーブルや列名)のみにアクセスするが、全てのクエリ実行は機関環境内でセキュアに行われ、患者レベルのデータがLLMに送信されたり、共有されたりしないことを保証する。アブレーション研究は、SQL生成パイプラインの各コンポーネント、特に数発のデモは、パフォーマンスにおいて重要な役割を果たすことを確認している。技術的な障壁を低くし、医学研究者がEHRデータベースを直接クエリできるようにすることで、CELECは研究ワークフローを簡素化し、生物医学的な発見を加速する。

論文の概要: Reliable Curation of EHR Dataset via Large Language Models under Environmental Constraints

関連論文リスト