Performance of the Pre-Trained Large Language Model GPT-4 on Automated
Short Answer Grading
- URL: http://arxiv.org/abs/2309.09338v1
- Date: Sun, 17 Sep 2023 18:04:34 GMT
- Title: Performance of the Pre-Trained Large Language Model GPT-4 on Automated
Short Answer Grading
- Authors: Gerd Kortemeyer
- Abstract summary: We studied the performance of GPT-4 on the standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle.
We found that the performance of the pre-trained general-purpose GPT-4 LLM is comparable to hand-engineered models, but worse than pre-trained LLMs that had specialized training.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automated Short Answer Grading (ASAG) has been an active area of
machine-learning research for over a decade. It promises to let educators grade
and give feedback on free-form responses in large-enrollment courses in spite
of limited availability of human graders. Over the years, carefully trained
models have achieved increasingly higher levels of performance. More recently,
pre-trained Large Language Models (LLMs) emerged as a commodity, and an
intriguing question is how a general-purpose tool without additional training
compares to specialized models. We studied the performance of GPT-4 on the
standard benchmark 2-way and 3-way datasets SciEntsBank and Beetle, where in
addition to the standard task of grading the alignment of the student answer
with a reference answer, we also investigated withholding the reference answer.
We found that overall, the performance of the pre-trained general-purpose GPT-4
LLM is comparable to hand-engineered models, but worse than pre-trained LLMs
that had specialized training.
Related papers
- Self-Judge: Selective Instruction Following with Alignment Self-Evaluation [27.69410513313001]
We study the study of selective instruction following, whereby the system declines to execute instructions if the anticipated response quality is low.
We introduce Self-J, a novel self-training framework for developing judge models without needing human-annotated quality scores.
arXiv Detail & Related papers (2024-09-02T04:14:13Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z) - RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs [60.38044044203333]
Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG)
We propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG.
For generation, we compare our model with many strong baselines, including GPT-4-0613, GPT-4-turbo-2024-0409, and ChatQA-1.5, an open-sourced model with the state-of-the-art performance on RAG benchmarks.
arXiv Detail & Related papers (2024-07-02T17:59:17Z) - InternLM2 Technical Report [159.70692271378581]
This paper introduces InternLM2, an open-source Large Language Models (LLMs) that outperforms its predecessors in comprehensive evaluations across 6 dimensions and 30 benchmarks.
The pre-training process of InternLM2 is meticulously detailed, highlighting the preparation of diverse data types.
InternLM2 efficiently captures long-term dependencies, initially trained on 4k tokens before advancing to 32k tokens in pre-training and fine-tuning stages.
arXiv Detail & Related papers (2024-03-26T00:53:24Z) - Efficient Classification of Student Help Requests in Programming Courses
Using Large Language Models [2.5949084781328744]
This study evaluates the performance of the GPT-3.5 and GPT-4 models for classifying help requests from students in an introductory programming class.
Fine-tuning the GPT-3.5 model improved its performance to such an extent that it approximated the accuracy and consistency across categories observed between two human raters.
arXiv Detail & Related papers (2023-10-31T00:56:33Z) - LIMA: Less Is More for Alignment [112.93890201395477]
We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses.
LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples.
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
arXiv Detail & Related papers (2023-05-18T17:45:22Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.