Can Language Models Evaluate Human Written Text? Case Study on Korean Student Writing for Education
- URL: http://arxiv.org/abs/2407.17022v1
- Date: Wed, 24 Jul 2024 06:02:57 GMT
- Title: Can Language Models Evaluate Human Written Text? Case Study on Korean Student Writing for Education
- Authors: Seungyoon Kim, Seungone Kim,
- Abstract summary: Large language model (LLM)-based evaluation pipelines have demonstrated their capability to robustly evaluate machine-generated text.
We investigate whether LLMs can effectively assess human-written text for educational purposes.
- Score: 1.6340559025561785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language model (LLM)-based evaluation pipelines have demonstrated their capability to robustly evaluate machine-generated text. Extending this methodology to assess human-written text could significantly benefit educational settings by providing direct feedback to enhance writing skills, although this application is not straightforward. In this paper, we investigate whether LLMs can effectively assess human-written text for educational purposes. We collected 100 texts from 32 Korean students across 15 types of writing and employed GPT-4-Turbo to evaluate them using grammaticality, fluency, coherence, consistency, and relevance as criteria. Our analyses indicate that LLM evaluators can reliably assess grammaticality and fluency, as well as more objective types of writing, though they struggle with other criteria and types of writing. We publicly release our dataset and feedback.
Related papers
- Evaluating AI-Generated Essays with GRE Analytical Writing Assessment [15.993966092824335]
This study examines essays generated by ten leading LLMs for the analytical writing assessment of the Graduate Record Exam (GRE)
We assessed these essays using both human raters and the e-rater automated scoring engine as used in the GRE scoring pipeline.
The top-performing Gemini and GPT-4o received an average score of 4.78 and 4.67, respectively.
arXiv Detail & Related papers (2024-10-22T21:30:58Z) - Exploring LLM Prompting Strategies for Joint Essay Scoring and Feedback Generation [13.854903594424876]
Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text.
This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback.
Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback.
arXiv Detail & Related papers (2024-04-24T12:48:06Z) - Navigating the Path of Writing: Outline-guided Text Generation with Large Language Models [8.920436030483872]
We propose Writing Path, a framework that uses explicit outlines to guide Large Language Models (LLMs) in generating user-aligned text.
Our approach draws inspiration from structured writing planning and reasoning paths, focusing on capturing and reflecting user intentions throughout the writing process.
arXiv Detail & Related papers (2024-04-22T06:57:43Z) - Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models [1.565361244756411]
This paper explores how large language models (LLMs) can be used to generate and evaluate reading comprehension items.
We developed a protocol for human and automatic evaluation, including a metric we call text informativity.
Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2.
arXiv Detail & Related papers (2024-04-11T13:11:21Z) - From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in LLMs-based Applications [26.857056013032263]
evaluating large language models (LLMs) is fundamental, particularly in the context of practical applications.
Our study shifts the focus from model-centered to human-centered evaluation in the context of AI-powered writing assistance applications.
arXiv Detail & Related papers (2024-04-10T15:46:08Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided.
We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis.
We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.