Fugu-MT 論文翻訳(概要): A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

論文の概要: A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

arxiv url: http://arxiv.org/abs/2604.04168v2
Date: Tue, 07 Apr 2026 07:29:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-08 12:54:27.255366
Title: A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models
Title（参考訳）: 小言語モデルを用いた小児病理組織診断のための半自動アノテーションワークフロー
Authors: Avish Vijayaraghavan, Jaskaran Singh Kawatra, Sebin Sabu, Jonny Sheldon, Will Poulett, Alex Eze, Daniel Key, John Booth, Shiren Patel, Jonny Pearson, Dan Schofield, Jonathan Hope, Pavithra Rajendran, Neil Sebire,
Abstract要約: 本研究では,小言語モデル(SLM)を用いた資源効率のよい半自動アノテーションワークフローを開発し,構造化EPRデータから構造化情報を抽出する。概念実証として、本ワークフローを小児腎生検報告、その制約された診断範囲と明確に定義された基礎生物学に応用した領域に適用する。 SLMを用いた自動情報抽出手法を開発しながら,本ワークフローを3つの会議の臨床的監視とともに反復的に開発し,グレート・オーモンド・ストリート病院の2,111件のデータセットから400件の報告をゴールドスタンダードとして手動で注釈付けする。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Electronic Patient Record (EPR) systems contain valuable clinical information, but much of it is trapped in unstructured text, limiting its use for research and decision-making. Large language models can extract such information but require substantial computational resources to run locally, and sending sensitive clinical data to cloud-based services, even when deidentified, raises significant patient privacy concerns. In this study, we develop a resource-efficient semi-automated annotation workflow using small language models (SLMs) to extract structured information from unstructured EPR data, focusing on paediatric histopathology reports. As a proof-of-concept, we apply the workflow to paediatric renal biopsy reports, a domain chosen for its constrained diagnostic scope and well-defined underlying biology. We develop the workflow iteratively with clinical oversight across three meetings, manually annotating 400 reports from a dataset of 2,111 at Great Ormond Street Hospital as a gold standard, while developing an automated information extraction approach using SLMs. We frame extraction as a Question-Answering task grounded by clinician-guided entity guidelines and few-shot examples, evaluating five instruction-tuned SLMs with a disagreement modelling framework to prioritise reports for clinical review. Gemma 2 2B achieves the highest accuracy at 84.3%, outperforming off-the-shelf models including spaCy (74.3%), BioBERT-SQuAD (62.3%), RoBERTa-SQuAD (59.7%), and GLiNER (60.2%). Entity guidelines improved performance by 7-19% over the zero-shot baseline, and few-shot examples by 6-38%, though their benefits do not compound when combined. These results demonstrate that SLMs can extract structured information from specialised clinical domains on CPU-only infrastructure with minimal clinician involvement. Our code is available at https://github.com/gosh-dre/nlp_renal_biopsy.
Abstract（参考訳）: 電子患者記録(EPR)システムには貴重な臨床情報が含まれているが、その多くが構造化されていないテキストに閉じ込められており、研究や意思決定に使用が制限されている。大規模な言語モデルはそのような情報を抽出することができるが、ローカルで実行するには相当な計算資源を必要とし、機密性の高い臨床データをクラウドベースのサービスに送信する。本研究では,小言語モデル(SLM)を用いた資源効率のよい半自動アノテーションワークフローを開発し,非構造化ERPデータから構造化情報を抽出し,小児の病理組織学報告に焦点をあてる。概念実証として、本ワークフローを小児腎生検報告、その制約された診断範囲と明確に定義された基礎生物学に応用した領域に適用する。 SLMを用いた自動情報抽出手法を開発しながら,本ワークフローを3つの会議の臨床的監視とともに反復的に開発し,グレート・オーモンド・ストリート病院の2,111件のデータセットから400件の報告をゴールドスタンダードとして手動で注釈付けする。本研究は,臨床診断ガイドラインに基づく質問応答課題の抽出と,臨床検査のためのレポートの優先順位付けのための不一致モデリングフレームワークを用いて,5つの指導訓練SLMの評価を行った。 Gemma 2 2Bは84.3%で最高精度を達成し、SpaCy (74.3%)、BioBERT-SQuAD (62.3%)、RoBERTa-SQuAD (59.7%)、GLiNER (60.2%)などの市販モデルを上回っている。エンティティガイドラインは、ゼロショットベースラインよりもパフォーマンスを7-19%改善し、少数ショット例を6-38%改善した。これらの結果から,SLMはCPUのみのインフラ上において,臨床領域の構造化情報を最小限に抽出できることが示唆された。私たちのコードはhttps://github.com/gosh-dre/nlp_renal_biopsy.comで公開されています。

論文の概要: A Semi-Automated Annotation Workflow for Paediatric Histopathology Reports Using Small Language Models

関連論文リスト