Chatbots put to the test in math and logic problems: A preliminary
comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard
- URL: http://arxiv.org/abs/2305.18618v1
- Date: Tue, 30 May 2023 11:18:05 GMT
- Title: Chatbots put to the test in math and logic problems: A preliminary
comparison and assessment of ChatGPT-3.5, ChatGPT-4, and Google Bard
- Authors: Vagelis Plevris, George Papazafeiropoulos, Alejandro Jim\'enez Rios
- Abstract summary: We use 30 questions that are clear, without any ambiguities, fully described with plain text only, and have a unique, well defined correct answer.
The answers are recorded and discussed, highlighting their strengths and weaknesses.
It was found that ChatGPT-4 outperforms ChatGPT-3.5 in both sets of questions.
- Score: 68.8204255655161
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A comparison between three chatbots which are based on large language models,
namely ChatGPT-3.5, ChatGPT-4 and Google Bard is presented, focusing on their
ability to give correct answers to mathematics and logic problems. In
particular, we check their ability to Understand the problem at hand; Apply
appropriate algorithms or methods for its solution; and Generate a coherent
response and a correct answer. We use 30 questions that are clear, without any
ambiguities, fully described with plain text only, and have a unique, well
defined correct answer. The questions are divided into two sets of 15 each. The
questions of Set A are 15 "Original" problems that cannot be found online,
while Set B contains 15 "Published" problems that one can find online, usually
with their solution. Each question is posed three times to each chatbot. The
answers are recorded and discussed, highlighting their strengths and
weaknesses. It has been found that for straightforward arithmetic, algebraic
expressions, or basic logic puzzles, chatbots may provide accurate solutions,
although not in every attempt. However, for more complex mathematical problems
or advanced logic tasks, their answers, although written in a usually
"convincing" way, may not be reliable. Consistency is also an issue, as many
times a chatbot will provide conflicting answers when given the same question
more than once. A comparative quantitative evaluation of the three chatbots is
made through scoring their final answers based on correctness. It was found
that ChatGPT-4 outperforms ChatGPT-3.5 in both sets of questions. Bard comes
third in the original questions of Set A, behind the other two chatbots, while
it has the best performance (first place) in the published questions of Set B.
This is probably because Bard has direct access to the internet, in contrast to
ChatGPT chatbots which do not have any communication with the outside world.
Related papers
- A Study on the Vulnerability of Test Questions against ChatGPT-based
Cheating [14.113742357609285]
ChatGPT can answer text prompts fairly accurately, even performing very well on postgraduate-level questions.
Many educators have found that their take-home or remote tests and exams are vulnerable to ChatGPT-based cheating.
arXiv Detail & Related papers (2024-02-21T23:51:06Z) - Comparative Analysis of ChatGPT, GPT-4, and Microsoft Bing Chatbots for GRE Test [0.0]
This research paper presents an analysis of how well three artificial intelligence chatbots: Bing, ChatGPT, and GPT-4, perform when answering questions from standardized tests.
A total of 137 questions with different forms of quantitative reasoning and 157 questions with verbal categories were used to assess their capabilities.
arXiv Detail & Related papers (2023-11-26T05:27:35Z) - Answering Ambiguous Questions with a Database of Questions, Answers, and
Revisions [95.92276099234344]
We present a new state-of-the-art for answering ambiguous questions that exploits a database of unambiguous questions generated from Wikipedia.
Our method improves performance by 15% on recall measures and 10% on measures which evaluate disambiguating questions from predicted outputs.
arXiv Detail & Related papers (2023-08-16T20:23:16Z) - Is Stack Overflow Obsolete? An Empirical Study of the Characteristics of
ChatGPT Answers to Stack Overflow Questions [7.065853028825656]
We conducted the first in-depth analysis of ChatGPT answers to programming questions on Stack Overflow.
We examined the correctness, consistency, comprehensiveness, and conciseness of ChatGPT answers.
Our analysis shows that 52% of ChatGPT answers contain incorrect information and 77% are verbose.
arXiv Detail & Related papers (2023-08-04T13:23:20Z) - ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models [49.52083248451775]
Large language models (LLMs) have made significant progress in NLP.
We specifically focus on ChatGPT, a widely used and easily accessible LLM.
We conduct a series of experiments on 11 datasets to evaluate ChatGPT's commonsense abilities.
arXiv Detail & Related papers (2023-03-29T03:05:43Z) - Chatbots as Problem Solvers: Playing Twenty Questions with Role
Reversals [0.0]
New chat AI applications like ChatGPT offer an advanced understanding of question context and memory across multi-step tasks.
This paper proposes a multi-role and multi-step challenge, where ChatGPT plays the classic twenty-questions game but innovatively switches roles from the questioner to the answerer.
arXiv Detail & Related papers (2023-01-01T03:04:04Z) - Implementing a Chatbot Solution for Learning Management System [0.0]
One of the main problem that chatbots face today is to mimic human language.
Extreme programming methodology was chosen to use integrate ChatterBot, Pyside2, web scraping and Tampermonkey into Blackboard.
We showed the plausibility of integrating an AI bot in an educational setting.
arXiv Detail & Related papers (2022-06-27T11:04:42Z) - QAConv: Question Answering on Informative Conversations [85.2923607672282]
We focus on informative conversations including business emails, panel discussions, and work channels.
In total, we collect 34,204 QA pairs, including span-based, free-form, and unanswerable questions.
arXiv Detail & Related papers (2021-05-14T15:53:05Z) - Put Chatbot into Its Interlocutor's Shoes: New Framework to Learn
Chatbot Responding with Intention [55.77218465471519]
This paper proposes an innovative framework to train chatbots to possess human-like intentions.
Our framework included a guiding robot and an interlocutor model that plays the role of humans.
We examined our framework using three experimental setups and evaluate the guiding robot with four different metrics to demonstrated flexibility and performance advantages.
arXiv Detail & Related papers (2021-03-30T15:24:37Z) - ConvAI3: Generating Clarifying Questions for Open-Domain Dialogue
Systems (ClariQ) [64.60303062063663]
This document presents a detailed description of the challenge on clarifying questions for dialogue systems (ClariQ)
The challenge is organized as part of the Conversational AI challenge series (ConvAI3) at Search Oriented Conversational AI (SCAI) EMNLP workshop in 2020.
arXiv Detail & Related papers (2020-09-23T19:48:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.