Related papers: Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness

Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness

URL: http://arxiv.org/abs/2304.11633v1
Date: Sun, 23 Apr 2023 12:33:18 GMT
Title: Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness
Authors: Bo Li, Gexiang Fang, Yang Yang, Quansen Wang, Wei Ye, Wen Zhao, Shikun Zhang
Abstract summary: We focus on assessing the overall ability of ChatGPT using 7 fine-grained information extraction (IE) tasks. ChatGPT's performance in Standard-IE setting is poor, but it surprisingly exhibits excellent performance in the OpenIE setting. ChatGPT provides high-quality and trustworthy explanations for its decisions.
Score: 18.945934162722466
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The capability of Large Language Models (LLMs) like ChatGPT to comprehend user intent and provide reasonable responses has made them extremely popular lately. In this paper, we focus on assessing the overall ability of ChatGPT using 7 fine-grained information extraction (IE) tasks. Specially, we present the systematically analysis by measuring ChatGPT's performance, explainability, calibration, and faithfulness, and resulting in 15 keys from either the ChatGPT or domain experts. Our findings reveal that ChatGPT's performance in Standard-IE setting is poor, but it surprisingly exhibits excellent performance in the OpenIE setting, as evidenced by human evaluation. In addition, our research indicates that ChatGPT provides high-quality and trustworthy explanations for its decisions. However, there is an issue of ChatGPT being overconfident in its predictions, which resulting in low calibration. Furthermore, ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. We manually annotate and release the test sets of 7 fine-grained IE tasks contains 14 datasets to further promote the research. The datasets and code are available at https://github.com/pkuserc/ChatGPT_for_IE.

Related papers

Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond [3.615835506868351]
We focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets.<n>We present the systematically analysis by measuring ChatGPT's performance, explainability, confidence, faithfulness, and uncertainty.
arXiv Detail & Related papers (2026-01-29T14:16:51Z)
Exploring the Capability of ChatGPT to Reproduce Human Labels for Social Computing Tasks (Extended Version) [26.643834593780007]
We investigate the extent to which ChatGPT can annotate data for social computing tasks. ChatGPT exhibits promise in handling data annotation tasks, albeit with some challenges. We propose GPT-Rater, a tool to predict if ChatGPT can correctly label data for a given annotation task.
arXiv Detail & Related papers (2024-07-08T22:04:30Z)
Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples. One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports. Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z)
Primacy Effect of ChatGPT [69.49920102917598]
We study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer. We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions.
arXiv Detail & Related papers (2023-10-20T00:37:28Z)
A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets [19.521390684403293]
We present a thorough evaluation of ChatGPT's performance on diverse academic datasets. Specifically, we evaluate ChatGPT across 140 tasks and analyze 255K responses it generates in these datasets.
arXiv Detail & Related papers (2023-05-29T12:37:21Z)
ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time [54.18651663847874]
ChatGPT has achieved great success and can be considered to have acquired an infrastructural status. Existing benchmarks encounter two challenges: (1) Disregard for periodical evaluation and (2) Lack of fine-grained features. We construct ChatLog, an ever-updating dataset with large-scale records of diverse long-form ChatGPT responses for 21 NLP benchmarks from March, 2023 to now.
arXiv Detail & Related papers (2023-04-27T11:33:48Z)
Exploring ChatGPT's Ability to Rank Content: A Preliminary Study on Consistency with Human Preferences [6.821378903525802]
ChatGPT has consistently demonstrated a remarkable level of accuracy and reliability in terms of content evaluation. A test set consisting of prompts is created, covering a wide range of use cases, and five models are utilized to generate corresponding responses. Results on the test set show that ChatGPT's ranking preferences are consistent with human to a certain extent.
arXiv Detail & Related papers (2023-03-14T03:13:02Z)
Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z)
Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries. We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models. We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z)
Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community. It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.