How is ChatGPT's behavior changing over time?
- URL: http://arxiv.org/abs/2307.09009v3
- Date: Tue, 31 Oct 2023 16:13:44 GMT
- Title: How is ChatGPT's behavior changing over time?
- Authors: Lingjiao Chen and Matei Zaharia and James Zou
- Abstract summary: We evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4.
We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.
- Score: 72.79311931941876
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GPT-3.5 and GPT-4 are the two most widely used large language model (LLM)
services. However, when and how these models are updated over time is opaque.
Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on
several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3)
opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating
code, 6) US Medical License tests, and 7) visual reasoning. We find that the
performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.
For example, GPT-4 (March 2023) was reasonable at identifying prime vs.
composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same
questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity
to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in
June than in March in this task. GPT-4 became less willing to answer sensitive
questions and opinion survey questions in June than in March. GPT-4 performed
better at multi-hop questions in June than in March, while GPT-3.5's
performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting
mistakes in code generation in June than in March. We provide evidence that
GPT-4's ability to follow user instructions has decreased over time, which is
one common factor behind the many behavior drifts. Overall, our findings show
that the behavior of the "same" LLM service can change substantially in a
relatively short amount of time, highlighting the need for continuous
monitoring of LLMs.
Related papers
- Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions [2.0411082897313984]
This study investigates how LLMs, specifically GPT-3.5 and GPT-4, can develop tailored questions for Grade 9 math.
By utilizing an iterative method, these models adjust questions based on difficulty and content, responding to feedback from a simulated'student' model.
arXiv Detail & Related papers (2024-06-20T00:25:43Z) - Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks [65.84623493488633]
This paper conducts a rigorous evaluation of GPT-4o against jailbreak attacks.
The newly introduced audio modality opens up new attack vectors for jailbreak attacks on GPT-4o.
Existing black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V.
arXiv Detail & Related papers (2024-06-10T14:18:56Z) - Unveiling Divergent Inductive Biases of LLMs on Temporal Data [4.561800294155325]
This research focuses on evaluating the performance of GPT-3.5 and GPT-4 models in the analysis of temporal data.
biases toward specific temporal relationships come to light, with GPT-3.5 demonstrating a preference for "AFTER'' in the QA format for both implicit and explicit events, while GPT-4 leans towards "BEFORE''
arXiv Detail & Related papers (2024-04-01T19:56:41Z) - Behind the Screen: Investigating ChatGPT's Dark Personality Traits and
Conspiracy Beliefs [0.0]
This paper analyzes the dark personality traits and conspiracy beliefs of GPT-3.5 and GPT-4.
Dark personality traits and conspiracy beliefs were not particularly pronounced in either model.
arXiv Detail & Related papers (2024-02-06T16:03:57Z) - GPT-4 Can't Reason [6.040938686276303]
GPT-4 was released in March 2023 to wide acclaim.
Despite its occasional flashes of analytical brilliance, GPT-4 at present is utterly incapable of reasoning.
arXiv Detail & Related papers (2023-07-21T17:04:25Z) - DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z) - Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains.
We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4.
Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z) - Gpt-4: A Review on Advancements and Opportunities in Natural Language
Processing [0.0]
Generative Pre-trained Transformer 4 (GPT-4) is the fourth-generation language model in the GPT series, developed by OpenAI.
GPT-4 has a larger model size (more than one trillion), better multilingual capabilities, improved contextual understanding, and reasoning capabilities than GPT-3.
Some of the potential applications of GPT-4 include chatbots, personal assistants, language translation, text summarization, and question-answering.
arXiv Detail & Related papers (2023-05-04T22:46:43Z) - Instruction Tuning with GPT-4 [107.55078894215798]
We present the first attempt to use GPT-4 to generate instruction-following data for finetuning large language models.
Our early experiments on instruction-tuned LLaMA models show that the 52K English and Chinese instruction-following data generated by GPT-4 leads to superior zero-shot performance on new tasks.
arXiv Detail & Related papers (2023-04-06T17:58:09Z) - Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data.
We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more.
We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.