Performance of the Pre-Trained Large Language Model GPT-4 on Automated
Short Answer Grading
- URL: http://arxiv.org/abs/2309.09338v1
- Date: Sun, 17 Sep 2023 18:04:34 GMT
- Title: Performance of the Pre-Trained Large Language Model GPT-4 on Automated
Short Answer Grading
- Authors: Gerd Kortemeyer
- Abstract summary: We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle.
We found that the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated Short Answer Grading (ASAG) has been an active area of
machine-learning research for over a decade. It promises to let educators grade
and give feedback on free-form responses in large-enrollment courses in spite
of limited availability of human graders. Over the years, carefully trained
models have achieved increasingly higher levels of performance. More recently,
pre-trained Large Language Models (LLMs) emerged as a commodity, and an
intriguing question is how a general-purpose tool without additional training
compares to specialized models. We studied the performance of GPT-4 on the
standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in
addition to the standard task of grading the alignment of the student answer
with a reference answer, we also investigated withholding the reference answer.
We found that overall, the performance of the pre-trained general-purpose GPT-4
LLM is comparable to hand-engineered models, but worse than pre-trained LLMs
that had specialized training.
Related papers
- RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs [60.38044044203333]
Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG)
We propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG.
For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks.
arXiv Detail & Related papers (2024-07-02T17:59:17Z) - InternLM2 Technical Report [159.70692271378581]
This paper introduces InternLM2, an open-source Large Language Models (LLMs) that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks.
The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types.
InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-03-26T00:53:24Z) - LLMs Still Can't Avoid Instanceof: An Investigation Into GPT-3.5, GPT-4
and Bard's Capacity to Handle Object-Oriented Programming Assignments [0.0]
Large Language Models (LLMs) have emerged as promising tools to assist students while solving programming assignments.
In this study, we experimented with three prominent LLMs to solve real-world OOP exercises used in educational settings.
The findings revealed that while the models frequently achieved mostly working solutions to the exercises, they often overlooked the best practices of OOP.
arXiv Detail & Related papers (2024-03-10T16:40:05Z) - Efficient Classification of Student Help Requests in Programming Courses
Using Large Language Models [2.5949084781328744]
This study evaluates the performance of the GPT-3.5 and GPT-4 models for classifying help requests from students in an introductory programming class.
Fine-tuning the GPT-3.5 model improved its performance to such an extent that it approximated the accuracy and consistency across categories observed between two human raters.
arXiv Detail & Related papers (2023-10-31T00:56:33Z) - LIMA: Less Is More for Alignment [112.93890201395477]
We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses.
LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples.
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
arXiv Detail & Related papers (2023-05-18T17:45:22Z) - Is ChatGPT Good at Search? Investigating Large Language Models as
Re-Ranking Agents [56.104476412839944]
Large Language Models (LLMs) have demonstrated remarkable zero-shot generalization across various language-related tasks.
This paper investigates generative LLMs for relevance ranking in Information Retrieval (IR)
To address concerns about data contamination of LLMs, we collect a new test set called NovelEval.
To improve efficiency in real-world applications, we delve into the potential for distilling the ranking capabilities of ChatGPT into small specialized models.
arXiv Detail & Related papers (2023-04-19T10:16:03Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.