Is Information Extraction Solved by ChatGPT? An Analysis of Performance,
Evaluation Criteria, Robustness and Errors
- URL: http://arxiv.org/abs/2305.14450v1
- Date: Tue, 23 May 2023 18:17:43 GMT
- Title: Is Information Extraction Solved by ChatGPT? An Analysis of Performance,
Evaluation Criteria, Robustness and Errors
- Authors: Ridong Han, Tao Peng, Chaohao Yang, Benyou Wang, Lu Liu, Xiang Wan
- Abstract summary: We first evaluate ChatGPT's performance on 17 datasets with 14 IE sub-tasks under the zero-shot, few-shot and chain-of-thought scenarios.
We then analyze the robustness of ChatGPT on 14 IE sub-tasks, and find that: 1) ChatGPT rarely outputs invalid responses; 2) Irrelevant context and long-tail target types greatly affect ChatGPT's performance; and 3) ChatGPT cannot understand well the subject-object relationships in RE task.
- Score: 14.911130381374793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: ChatGPT has stimulated the research boom in the field of large language
models. In this paper, we assess the capabilities of ChatGPT from four
perspectives including Performance, Evaluation Criteria, Robustness and Error
Types. Specifically, we first evaluate ChatGPT's performance on 17 datasets
with 14 IE sub-tasks under the zero-shot, few-shot and chain-of-thought
scenarios, and find a huge performance gap between ChatGPT and SOTA results.
Next, we rethink this gap and propose a soft-matching strategy for evaluation
to more accurately reflect ChatGPT's performance. Then, we analyze the
robustness of ChatGPT on 14 IE sub-tasks, and find that: 1) ChatGPT rarely
outputs invalid responses; 2) Irrelevant context and long-tail target types
greatly affect ChatGPT's performance; 3) ChatGPT cannot understand well the
subject-object relationships in RE task. Finally, we analyze the errors of
ChatGPT, and find that "unannotated spans" is the most dominant error type.
This raises concerns about the quality of annotated data, and indicates the
possibility of annotating data with ChatGPT. The data and code are released at
Github site.
Related papers
- Exploring the Capability of ChatGPT to Reproduce Human Labels for Social Computing Tasks (Extended Version) [26.643834593780007]
We investigate the extent to which ChatGPT can annotate data for social computing tasks.
ChatGPT exhibits promise in handling data annotation tasks, albeit with some challenges.
We propose GPT-Rater, a tool to predict if ChatGPT can correctly label data for a given annotation task.
arXiv Detail & Related papers (2024-07-08T22:04:30Z) - Exploring ChatGPT's Capabilities on Vulnerability Management [56.4403395100589]
We explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples.
One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports.
Our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions.
arXiv Detail & Related papers (2023-11-11T11:01:13Z) - ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time [54.18651663847874]
ChatGPT has achieved great success and can be considered to have acquired an infrastructural status.
Existing benchmarks encounter two challenges: (1) Disregard for periodical evaluation and (2) Lack of fine-grained features.
We construct ChatLog, an ever-updating dataset with large-scale records of diverse long-form ChatGPT responses for 21 NLP benchmarks from March, 2023 to now.
arXiv Detail & Related papers (2023-04-27T11:33:48Z) - Evaluating ChatGPT's Information Extraction Capabilities: An Assessment
of Performance, Explainability, Calibration, and Faithfulness [18.945934162722466]
We focus on assessing the overall ability of ChatGPT using 7 fine-grained information extraction (IE) tasks.
ChatGPT's performance in Standard-IE setting is poor, but it surprisingly exhibits excellent performance in the OpenIE setting.
ChatGPT provides high-quality and trustworthy explanations for its decisions.
arXiv Detail & Related papers (2023-04-23T12:33:18Z) - A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability [57.71052396828714]
This paper presents the first comprehensive analysis of ChatGPT's Text-to- abilities.
We conducted experiments on 12 benchmark datasets with different languages, settings, or scenarios.
Although there is still a gap from the current state-of-the-art (SOTA) model performance, ChatGPT's performance is still impressive.
arXiv Detail & Related papers (2023-03-12T04:22:01Z) - Is ChatGPT a Good NLG Evaluator? A Preliminary Study [121.77986688862302]
We provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric.
Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments.
We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.
arXiv Detail & Related papers (2023-03-07T16:57:20Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.