Related papers: Comparative Analysis of ChatGPT, GPT-4, and Microsoft Bing Chatbots for GRE Test

Comparative Analysis of ChatGPT, GPT-4, and Microsoft Bing Chatbots for GRE Test

URL: http://arxiv.org/abs/2312.03719v4
Date: Sat, 6 Apr 2024 18:37:05 GMT
Title: Comparative Analysis of ChatGPT, GPT-4, and Microsoft Bing Chatbots for GRE Test
Authors: Mohammad Abu-Haifa, Bara'a Etawi, Huthaifa Alkhatatbeh, Ayman Ababneh,
Abstract summary: This research paper presents an analysis of how well three artificial intelligence chatbots: Bing, ChatGPT, and GPT-4, perform when answering questions from standardized tests. A total of 137 questions with different forms of quantitative reasoning and 157 questions with verbal categories were used to assess their capabilities.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This research paper presents an analysis of how well three artificial intelligence chatbots: Bing, ChatGPT, and GPT-4, perform when answering questions from standardized tests. The Graduate Record Examination is used in this paper as a case study. A total of 137 questions with different forms of quantitative reasoning and 157 questions with verbal categories were used to assess their capabilities. This paper presents the performance of each chatbot across various skills and styles tested in the exam. The proficiency of these chatbots in addressing image-based questions is also explored, and the uncertainty level of each chatbot is illustrated. The results show varying degrees of success across the chatbots, where GPT-4 served as the most proficient, especially in complex language understanding tasks and image-based questions. Results highlight the ability of these chatbots to pass the GRE with a high score, which encourages the use of these chatbots in test preparation. The results also show how important it is to ensure that, if the test is administered online, as it was during COVID, the test taker is segregated from these resources for a fair competition on higher education opportunities.

Related papers

Der Effizienz- und Intelligenzbegriff in der Lexikographie und kuenstlichen Intelligenz: kann ChatGPT die lexikographische Textsorte nachbilden? [0.0]
This paper examines the concept of efficiency and intelligence in lexicography and artificial intelligence, AI. The aim of the experiments is to gain empirically and statistically based insights into the lexicographical text type,dictionary article.
arXiv Detail & Related papers (2024-12-11T18:18:07Z)
Self-Directed Turing Test for Large Language Models [56.64615470513102]
The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations. Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time. This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format.
arXiv Detail & Related papers (2024-08-19T09:57:28Z)
A Study on the Vulnerability of Test Questions against ChatGPT-based Cheating [14.113742357609285]
ChatGPT can answer text prompts fairly accurately, even performing very well on postgraduate-level questions. Many educators have found that their take-home or remote tests and exams are vulnerable to ChatGPT-based cheating.
arXiv Detail & Related papers (2024-02-21T23:51:06Z)
In Generative AI we Trust: Can Chatbots Effectively Verify Political Information? [39.58317527488534]
This article presents a comparative analysis of the ability of two large language model (LLM)-based chatbots, ChatGPT and Bing Chat, to detect veracity of political information. We use AI auditing methodology to investigate how chatbots evaluate true, false, and borderline statements on five topics: COVID-19, Russian aggression against Ukraine, the Holocaust, climate change, and LGBTQ+ related debates. The results show high performance of ChatGPT for the baseline veracity evaluation task, with 72 percent of the cases evaluated correctly on average across languages without pre-training.
arXiv Detail & Related papers (2023-12-20T15:17:03Z)
ChatGPT Performance on Standardized Testing Exam -- A Proposed Strategy for Learners [0.0]
This study explores the problem solving capabilities of ChatGPT and its prospective applications in standardized test preparation, focusing on the GRE quantitative exam. We investigate how ChatGPT performs across various question types in the GRE quantitative domain, and how modifying question prompts impacts its accuracy.
arXiv Detail & Related papers (2023-09-25T20:25:29Z)
Can ChatGPT pass the Vietnamese National High School Graduation Examination? [0.0]
The study dataset included 30 essays in the literature test case and 1,700 multiple-choice questions designed for other subjects. ChatGPT was able to pass the examination with an average score of 6-7, demonstrating the technology's potential to revolutionize the educational landscape.
arXiv Detail & Related papers (2023-06-15T14:47:03Z)
Chatbots put to the test in math and logic problems: A preliminary comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard [68.8204255655161]
We use 30 questions that are clear, without any ambiguities, fully described with plain text only, and have a unique, well defined correct answer. The answers are recorded and discussed, highlighting their strengths and weaknesses. It was found that ChatGPT-4 outperforms ChatGPT-3.5 in both sets of questions.
arXiv Detail & Related papers (2023-05-30T11:18:05Z)
To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection. We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains. Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z)
Can AI Chatbots Pass the Fundamentals of Engineering (FE) and Principles and Practice of Engineering (PE) Structural Exams? [1.0554048699217669]
ChatGPT-4 and Bard, respectively, scored 70.9% and 39.2% in the FE exam and 46.2% and 41% in the PE exam. It is evident that the current version of ChatGPT-4 could potentially pass the FE exam.
arXiv Detail & Related papers (2023-03-31T15:37:17Z)
Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot. Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community. It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z)
A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation. It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries. However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z)
Put Chatbot into Its Interlocutor's Shoes: New Framework to Learn Chatbot Responding with Intention [55.77218465471519]
This paper proposes an innovative framework to train chatbots to possess human-like intentions. Our framework included a guiding robot and an interlocutor model that plays the role of humans. We examined our framework using three experimental setups and evaluate the guiding robot with four different metrics to demonstrated flexibility and performance advantages.
arXiv Detail & Related papers (2021-03-30T15:24:37Z)
Investigation of Sentiment Controllable Chatbot [50.34061353512263]
In this paper, we investigate four models to scale or adjust the sentiment of the response. The models are a persona-based model, reinforcement learning, a plug and play model, and CycleGAN. We develop machine-evaluated metrics to estimate whether the responses are reasonable given the input.
arXiv Detail & Related papers (2020-07-11T16:04:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.