Generative AI Takes a Statistics Exam: A Comparison of Performance between ChatGPT3.5, ChatGPT4, and ChatGPT4o-mini
- URL: http://arxiv.org/abs/2501.09171v1
- Date: Wed, 15 Jan 2025 21:46:01 GMT
- Title: Generative AI Takes a Statistics Exam: A Comparison of Performance between ChatGPT3.5, ChatGPT4, and ChatGPT4o-mini
- Authors: Monnie McGee, Bivin Sadler,
- Abstract summary: We investigate the performance of GPT versions 3.5, 4.0, and 4o-mini on the same 16-question statistics exam given to a class of first-year graduate students.
Results indicate that GPT3.5 and 4o-mini have characteristics that are more similar than either of them have with GPT4.
- Score: 0.0
- License:
- Abstract: Many believe that use of generative AI as a private tutor has the potential to shrink access and achievement gaps between students and schools with abundant resources versus those with fewer resources. Shrinking the gap is possible only if paid and free versions of the platforms perform with the same accuracy. In this experiment, we investigate the performance of GPT versions 3.5, 4.0, and 4o-mini on the same 16-question statistics exam given to a class of first-year graduate students. While we do not advocate using any generative AI platform to complete an exam, the use of exam questions allows us to explore aspects of ChatGPT's responses to typical questions that students might encounter in a statistics course. Results on accuracy indicate that GPT 3.5 would fail the exam, GPT4 would perform well, and GPT4o-mini would perform somewhere in between. While we acknowledge the existence of other Generative AI/LLMs, our discussion concerns only ChatGPT because it is the most widely used platform on college campuses at this time. We further investigate differences among the AI platforms in the answers for each problem using methods developed for text analytics, such as reading level evaluation and topic modeling. Results indicate that GPT3.5 and 4o-mini have characteristics that are more similar than either of them have with GPT4.
Related papers
- Equity in the Use of ChatGPT for the Classroom: A Comparison of the Accuracy and Precision of ChatGPT 3.5 vs. ChatGPT4 with Respect to Statistics and Data Science Exams [0.0]
ChatGPT is the most popular of these platforms, and it has a paid version (ChatGPT4) and a free version (ChatGPT3.5)
We determine the extent to which students who cannot afford newer and faster versions of generative AI programs would be disadvantaged in terms of writing such projects and learning these methods.
arXiv Detail & Related papers (2024-12-17T17:38:13Z) - Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams [48.99818550820575]
We leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to automatically grade handwritten responses to college-level math exams.
Using real student responses to questions in a probability theory exam, we evaluate GPT-4o's alignment with ground-truth scores from human graders using various prompting techniques.
arXiv Detail & Related papers (2024-11-07T22:51:47Z) - How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO [55.25989137825992]
We introduce ECHO, an evaluative framework inspired by the Turing test.
This framework engages the acquaintances of the target individuals to distinguish between human and machine-generated responses.
We evaluate three role-playing LLMs using ECHO, with GPT-3.5 and GPT-4 serving as foundational models.
arXiv Detail & Related papers (2024-04-22T08:00:51Z) - Constrained C-Test Generation via Mixed-Integer Programming [55.28927994487036]
This work proposes a novel method to generate C-Tests; a form of cloze tests (a gap filling exercise) where only the last part of a word is turned into a gap.
In contrast to previous works that only consider varying the gap size or gap placement to achieve locally optimal solutions, we propose a mixed-integer programming (MIP) approach.
We publish our code, model, and collected data consisting of 32 English C-Tests with 20 gaps each (totaling 3,200 individual gap responses) under an open source license.
arXiv Detail & Related papers (2024-04-12T21:35:21Z) - Text Understanding in GPT-4 vs Humans [2.024925013349319]
We examine whether a leading AI system GPT4 understands text as well as humans do.
We first use a well-established standardized test of discourse comprehension.
Next, we use more difficult passages to determine whether that could allow larger differences between GPT4 and humans.
arXiv Detail & Related papers (2024-03-25T21:17:14Z) - How is ChatGPT's behavior changing over time? [72.79311931941876]
We evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4.
We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time.
arXiv Detail & Related papers (2023-07-18T06:56:08Z) - ChatGPT-Crawler: Find out if ChatGPT really knows what it's talking
about [15.19126287569545]
This research examines the responses generated by ChatGPT from different Conversational QA corpora.
The study employed BERT similarity scores to compare these responses with correct answers and obtain Natural Language Inference(NLI) labels.
The study identified instances where ChatGPT provided incorrect answers to questions, providing insights into areas where the model may be prone to error.
arXiv Detail & Related papers (2023-04-06T18:42:47Z) - Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data.
We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more.
We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z) - A Complete Survey on Generative AI (AIGC): Is ChatGPT from GPT-4 to
GPT-5 All You Need? [112.12974778019304]
generative AI (AIGC, a.k.a AI-generated content) has made headlines everywhere because of its ability to analyze and create text, images, and beyond.
In the era of AI transitioning from pure analysis to creation, it is worth noting that ChatGPT, with its most recent language model GPT-4, is just a tool out of numerous AIGC tasks.
This work focuses on the technological development of various AIGC tasks based on their output type, including text, images, videos, 3D content, etc.
arXiv Detail & Related papers (2023-03-21T10:09:47Z) - ChatGPT: Jack of all trades, master of none [4.693597927153063]
OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT)
We examined ChatGPT's capabilities on 25 diverse analytical NLP tasks.
We automated ChatGPT and GPT-4 prompting process and analyzed more than 49k responses.
arXiv Detail & Related papers (2023-02-21T15:20:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.