Related papers: Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond

Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond

URL: http://arxiv.org/abs/2601.21767v1
Date: Thu, 29 Jan 2026 14:16:51 GMT
Title: Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond
Authors: Wei Zhu,
Abstract summary: We focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets.<n>We present the systematically analysis by measuring ChatGPT's performance, explainability, confidence, faithfulness, and uncertainty.
Score: 3.615835506868351
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) like ChatGPT have demonstrated amazing capabilities in comprehending user intents and generate reasonable and useful responses. Beside their ability to chat, their capabilities in various natural language processing (NLP) tasks are of interest to the research community. In this paper, we focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets. We present the systematically analysis by measuring ChatGPT's performance, explainability, confidence, faithfulness, and uncertainty. Our experiments reveal that: (a) ChatGPT's performance scores on MedIE tasks fall behind those of the fine-tuned baseline models. (b) ChatGPT can provide high-quality explanations for its decisions, however, ChatGPT is over-confident in its predcitions. (c) ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. (d) The uncertainty in generation causes uncertainty in information extraction results, thus may hinder its applications in MedIE tasks.

Related papers

A Comparison of Human and ChatGPT Classification Performance on Complex Social Media Data [7.492722530877262]
We measure the performance of GPT-4 on one task and compare results to human annotators.<n>We craft four prompt styles as input and evaluate precision, recall, and F1 scores.<n>Our results suggest the use of ChatGPT in classification tasks involving nuanced language should be conducted with prudence.
arXiv Detail & Related papers (2025-11-29T23:59:58Z)
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets [19.521390684403293]
We present a thorough evaluation of ChatGPT's performance on diverse academic datasets. Specifically, we evaluate ChatGPT across 140 tasks and analyze 255K responses it generates in these datasets.
arXiv Detail & Related papers (2023-05-29T12:37:21Z)
Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks. We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z)
Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness [18.945934162722466]
We focus on assessing the overall ability of ChatGPT using 7 fine-grained information extraction (IE) tasks. ChatGPT's performance in Standard-IE setting is poor, but it surprisingly exhibits excellent performance in the OpenIE setting. ChatGPT provides high-quality and trustworthy explanations for its decisions.
arXiv Detail & Related papers (2023-04-23T12:33:18Z)
To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection. We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains. Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z)
Towards Making the Most of ChatGPT for Machine Translation [75.576405098545]
ChatGPT shows remarkable capabilities for machine translation (MT) Several prior studies have shown that it achieves comparable results to commercial systems for high-resource languages.
arXiv Detail & Related papers (2023-03-24T03:35:21Z)
Exploring ChatGPT's Ability to Rank Content: A Preliminary Study on Consistency with Human Preferences [6.821378903525802]
ChatGPT has consistently demonstrated a remarkable level of accuracy and reliability in terms of content evaluation. A test set consisting of prompts is created, covering a wide range of use cases, and five models are utilized to generate corresponding responses. Results on the test set show that ChatGPT's ranking preferences are consistent with human to a certain extent.
arXiv Detail & Related papers (2023-03-14T03:13:02Z)
Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z)
On the Robustness of ChatGPT: An Adversarial and Out-of-distribution Perspective [67.98821225810204]
We evaluate the robustness of ChatGPT from the adversarial and out-of-distribution perspective. Results show consistent advantages on most adversarial and OOD classification and translation tasks. ChatGPT shows astounding performance in understanding dialogue-related texts.
arXiv Detail & Related papers (2023-02-22T11:01:20Z)
Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries. We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models. We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z)
Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community. It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.