Learning gain differences between ChatGPT and human tutor generated
algebra hints
- URL: http://arxiv.org/abs/2302.06871v1
- Date: Tue, 14 Feb 2023 07:20:48 GMT
- Title: Learning gain differences between ChatGPT and human tutor generated
algebra hints
- Authors: Zachary A. Pardos, Shreya Bhandari
- Abstract summary: We conduct the first learning gain evaluation of ChatGPT by comparing the efficacy of its hints with hints authored by human tutors.
We find that 70% of hints produced by ChatGPT passed our manual quality checks and that both human and ChatGPT conditions produced positive learning gains.
- Score: 4.438259529250529
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs), such as ChatGPT, are quickly advancing AI to
the frontiers of practical consumer use and leading industries to re-evaluate
how they allocate resources for content production. Authoring of open
educational resources and hint content within adaptive tutoring systems is
labor intensive. Should LLMs like ChatGPT produce educational content on par
with human-authored content, the implications would be significant for further
scaling of computer tutoring system approaches. In this paper, we conduct the
first learning gain evaluation of ChatGPT by comparing the efficacy of its
hints with hints authored by human tutors with 77 participants across two
algebra topic areas, Elementary Algebra and Intermediate Algebra. We find that
70% of hints produced by ChatGPT passed our manual quality checks and that both
human and ChatGPT conditions produced positive learning gains. However, gains
were only statistically significant for human tutor created hints. Learning
gains from human-created hints were substantially and statistically
significantly higher than ChatGPT hints in both topic areas, though ChatGPT
participants in the Intermediate Algebra experiment were near ceiling and not
even with the control at pre-test. We discuss the limitations of our study and
suggest several future directions for the field. Problem and hint content used
in the experiment is provided for replicability.
Related papers
- Using ChatGPT for Science Learning: A Study on Pre-service Teachers'
Lesson Planning [0.7416846035207727]
This study analyzed lesson plans developed by 29 pre-service elementary teachers from a Korean university.
14 types of teaching and learning methods/strategies were identified in the lesson plans.
The study identified both appropriate and inappropriate use cases of ChatGPT in lesson planning.
arXiv Detail & Related papers (2024-01-18T22:52:04Z) - Chatbots Are Not Reliable Text Annotators [0.0]
ChatGPT is a closed-source product which has major drawbacks with regards to transparency, cost, and data protection.
Recent advances in open-source (OS) large language models (LLMs) offer alternatives which remedy these challenges.
arXiv Detail & Related papers (2023-11-09T22:28:14Z) - ChatGPT is not a pocket calculator -- Problems of AI-chatbots for
teaching Geography [0.11049608786515837]
ChatGPT can be fraudulent because it threatens the validity of assessments.
Based on a preliminary survey on ChatGPT's quality in answering questions in Geography and GIScience, we demonstrate that this assumption might be fairly naive.
arXiv Detail & Related papers (2023-07-03T15:35:21Z) - Transformative Effects of ChatGPT on Modern Education: Emerging Era of
AI Chatbots [36.760677949631514]
ChatGPT was released to provide coherent and useful replies based on analysis of large volumes of data.
Our preliminary evaluation concludes that ChatGPT performed differently in each subject area including finance, coding and maths.
There are clear drawbacks in its use, such as the possibility of producing inaccurate or false data.
Academic regulations and evaluation practices need to be updated, should ChatGPT be used as a tool in education.
arXiv Detail & Related papers (2023-05-25T17:35:57Z) - ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large
Language Models in Multilingual Learning [70.57126720079971]
Large language models (LLMs) have emerged as the most important breakthroughs in natural language processing (NLP)
This paper evaluates ChatGPT on 7 different tasks, covering 37 diverse languages with high, medium, low, and extremely low resources.
Compared to the performance of previous models, our extensive experimental results demonstrate a worse performance of ChatGPT for different NLP tasks and languages.
arXiv Detail & Related papers (2023-04-12T05:08:52Z) - ChatGPT-Crawler: Find out if ChatGPT really knows what it's talking
about [15.19126287569545]
This research examines the responses generated by ChatGPT from different Conversational QA corpora.
The study employed BERT similarity scores to compare these responses with correct answers and obtain Natural Language Inference(NLI) labels.
The study identified instances where ChatGPT provided incorrect answers to questions, providing insights into areas where the model may be prone to error.
arXiv Detail & Related papers (2023-04-06T18:42:47Z) - To ChatGPT, or not to ChatGPT: That is the question! [78.407861566006]
This study provides a comprehensive and contemporary assessment of the most recent techniques in ChatGPT detection.
We have curated a benchmark dataset consisting of prompts from ChatGPT and humans, including diverse questions from medical, open Q&A, and finance domains.
Our evaluation results demonstrate that none of the existing methods can effectively detect ChatGPT-generated content.
arXiv Detail & Related papers (2023-04-04T03:04:28Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on
Reasoning, Hallucination, and Interactivity [79.12003701981092]
We carry out an extensive technical evaluation of ChatGPT using 23 data sets covering 8 different common NLP application tasks.
We evaluate the multitask, multilingual and multi-modal aspects of ChatGPT based on these data sets and a newly designed multimodal dataset.
ChatGPT is 63.41% accurate on average in 10 different reasoning categories under logical reasoning, non-textual reasoning, and commonsense reasoning.
arXiv Detail & Related papers (2023-02-08T12:35:34Z) - Is ChatGPT a General-Purpose Natural Language Processing Task Solver? [113.22611481694825]
Large language models (LLMs) have demonstrated the ability to perform a variety of natural language processing (NLP) tasks zero-shot.
Recently, the debut of ChatGPT has drawn a great deal of attention from the natural language processing (NLP) community.
It is not yet known whether ChatGPT can serve as a generalist model that can perform many NLP tasks zero-shot.
arXiv Detail & Related papers (2023-02-08T09:44:51Z) - A Categorical Archive of ChatGPT Failures [47.64219291655723]
ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation.
It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries.
However, a comprehensive analysis of ChatGPT's failures is lacking, which is the focus of this study.
arXiv Detail & Related papers (2023-02-06T04:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.