A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models
- URL: http://arxiv.org/abs/2303.10420v2
- Date: Sat, 23 Dec 2023 12:53:02 GMT
- Title: A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models
- Authors: Junjie Ye, Xuanting Chen, Nuo Xu, Can Zu, Zekai Shao, Shichun Liu,
Yuhan Cui, Zeyang Zhou, Chao Gong, Yang Shen, Jie Zhou, Siming Chen, Tao Gui,
Qi Zhang, Xuanjing Huang
- Abstract summary: GPT series models have gained considerable attention due to their exceptional natural language processing capabilities.
We select six representative models, comprising two GPT-3 series models and four GPT-3.5 series models.
We evaluate their performance on nine natural language understanding (NLU) tasks using 21 datasets.
Our experiments reveal that the overall ability of GPT series models on NLU tasks does not increase gradually as the models evolve.
- Score: 71.42197262495056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GPT series models, such as GPT-3, CodeX, InstructGPT, ChatGPT, and so on,
have gained considerable attention due to their exceptional natural language
processing capabilities. However, despite the abundance of research on the
difference in capabilities between GPT series models and fine-tuned models,
there has been limited attention given to the evolution of GPT series models'
capabilities over time. To conduct a comprehensive analysis of the capabilities
of GPT series models, we select six representative models, comprising two GPT-3
series models (i.e., davinci and text-davinci-001) and four GPT-3.5 series
models (i.e., code-davinci-002, text-davinci-002, text-davinci-003, and
gpt-3.5-turbo). We evaluate their performance on nine natural language
understanding (NLU) tasks using 21 datasets. In particular, we compare the
performance and robustness of different models for each task under zero-shot
and few-shot scenarios. Our extensive experiments reveal that the overall
ability of GPT series models on NLU tasks does not increase gradually as the
models evolve, especially with the introduction of the RLHF training strategy.
While this strategy enhances the models' ability to generate human-like
responses, it also compromises their ability to solve some tasks. Furthermore,
our findings indicate that there is still room for improvement in areas such as
model robustness.
Related papers
- Large language models for aspect-based sentiment analysis [0.0]
We assess the performance of GPT-4 and GPT-3.5 in zero shot, few shot and fine-tuned settings.
Fine-tuned GPT-3.5 achieves a state-of-the-art F1 score of 83.8 on the joint aspect term extraction and polarity classification task.
arXiv Detail & Related papers (2023-10-27T10:03:21Z) - Black-Box Analysis: GPTs Across Time in Legal Textual Entailment Task [17.25356594832692]
We present an analysis of GPT-3.5 (ChatGPT) and GPT-4 performances on COLIEE Task 4 dataset.
Our preliminary experimental results unveil intriguing insights into the models' strengths and weaknesses in handling legal textual entailment tasks.
arXiv Detail & Related papers (2023-09-11T14:43:54Z) - InheritSumm: A General, Versatile and Compact Summarizer by Distilling
from GPT [75.29359361404073]
InheritSumm is a versatile and compact summarization model derived from GPT-3.5 through distillation.
It achieves similar or superior performance to GPT-3.5 in zeroshot and fewshot settings.
arXiv Detail & Related papers (2023-05-22T14:52:32Z) - GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot
Setting and Performance Boosting Through Prompts [0.0]
Large Language Models (LLMs) have exhibited remarkable performance on various Natural Language Processing (NLP) tasks.
In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks.
arXiv Detail & Related papers (2023-05-21T14:45:17Z) - Exploring the Trade-Offs: Unified Large Language Models vs Local
Fine-Tuned Models for Highly-Specific Radiology NLI Task [49.50140712943701]
We evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples.
We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty.
arXiv Detail & Related papers (2023-04-18T17:21:48Z) - How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language
Understanding Tasks [65.7949334650854]
GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks.
However, their robustness and abilities to handle various complexities of the open world have yet to be explored.
We show that GPT-3.5 faces some specific robustness challenges, including instability, prompt sensitivity, and number sensitivity.
arXiv Detail & Related papers (2023-03-01T07:39:01Z) - Emergent Analogical Reasoning in Large Language Models [1.5469452301122177]
We show that GPT-3 has a surprisingly strong capacity for abstract pattern induction, matching or even surpassing human capabilities in most settings.
Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems.
arXiv Detail & Related papers (2022-12-19T00:04:56Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance.
We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.