Pushing the Limits of ChatGPT on NLP Tasks
- URL: http://arxiv.org/abs/2306.09719v2
- Date: Mon, 9 Oct 2023 15:48:23 GMT
- Title: Pushing the Limits of ChatGPT on NLP Tasks
- Authors: Xiaofei Sun, Linfeng Dong, Xiaoya Li, Zhen Wan, Shuhe Wang, Tianwei
Zhang, Jiwei Li, Fei Cheng, Lingjuan Lyu, Fei Wu, Guoyin Wang
- Abstract summary: Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines.
In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors.
We propose a collection of general modules to address these issues, in an attempt to push the limits of ChatGPT on NLP tasks.
- Score: 79.17291002710517
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the success of ChatGPT, its performances on most NLP tasks are still
well below the supervised baselines. In this work, we looked into the causes,
and discovered that its subpar performance was caused by the following factors:
(1) token limit in the prompt does not allow for the full utilization of the
supervised datasets; (2) mismatch between the generation nature of ChatGPT and
NLP tasks; (3) intrinsic pitfalls of LLMs models, e.g., hallucination, overly
focus on certain keywords, etc.
In this work, we propose a collection of general modules to address these
issues, in an attempt to push the limits of ChatGPT on NLP tasks. Our proposed
modules include (1) a one-input-multiple-prompts strategy that employs multiple
prompts for one input to accommodate more demonstrations; (2) using fine-tuned
models for better demonstration retrieval; (3) transforming tasks to formats
that are more tailored to the generation nature; (4) employing reasoning
strategies that are tailored to addressing the task-specific complexity; (5)
the self-verification strategy to address the hallucination issue of LLMs; (6)
the paraphrase strategy to improve the robustness of model predictions.
We conduct experiments on 21 datasets of 10 representative NLP tasks,
including question answering, commonsense reasoning, natural language
inference, sentiment analysis, named entity recognition, entity-relation
extraction, event extraction, dependency parsing, semantic role labeling, and
part-of-speech tagging. Using the proposed assemble of techniques, we are able
to significantly boost the performance of ChatGPT on the selected NLP tasks,
achieving performances comparable to or better than supervised baselines, or
even existing SOTA performances.
Related papers
- RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners [38.30539869264287]
Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks.
However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes.
We introduce RankPrompt, a new prompting method that enables LLMs to self-rank their responses without additional resources.
arXiv Detail & Related papers (2024-03-19T02:34:18Z) - The Shifted and The Overlooked: A Task-oriented Investigation of
User-GPT Interactions [114.67699010359637]
We analyze a large-scale collection of real user queries to GPT.
We find that tasks such as design'' and planning'' are prevalent in user interactions but are largely neglected or different from traditional NLP benchmarks.
arXiv Detail & Related papers (2023-10-19T02:12:17Z) - A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark
Datasets [19.521390684403293]
We present a thorough evaluation of ChatGPT's performance on diverse academic datasets.
Specifically, we evaluate ChatGPT across 140 tasks and analyze 255K responses it generates in these datasets.
arXiv Detail & Related papers (2023-05-29T12:37:21Z) - ChatGraph: Interpretable Text Classification by Converting ChatGPT
Knowledge to Graphs [54.48467003509595]
ChatGPT has shown superior performance in various natural language processing (NLP) tasks.
We propose a novel framework that leverages the power of ChatGPT for specific tasks, such as text classification.
Our method provides a more transparent decision-making process compared with previous text classification methods.
arXiv Detail & Related papers (2023-05-03T19:57:43Z) - Exploring the Feasibility of ChatGPT for Event Extraction [31.175880361951172]
Event extraction is a fundamental task in natural language processing that involves identifying and extracting information about events mentioned in text.
ChatGPT provides an opportunity to solve language tasks with simple prompts without the need for task-specific datasets and fine-tuning.
We show that ChatGPT has, on average, only 51.04% of the performance of a task-specific model such as EEQA in long-tail and complex scenarios.
arXiv Detail & Related papers (2023-03-07T12:03:58Z) - A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks.
We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.
ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z) - AdaPrompt: Adaptive Model Training for Prompt-based NLP [77.12071707955889]
We propose AdaPrompt, adaptively retrieving external data for continual pretraining of PLMs.
Experimental results on five NLP benchmarks show that AdaPrompt can improve over standard PLMs in few-shot settings.
In zero-shot settings, our method outperforms standard prompt-based methods by up to 26.35% relative error reduction.
arXiv Detail & Related papers (2022-02-10T04:04:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.