ArguGPT: evaluating, understanding and identifying argumentative essays
generated by GPT models
- URL: http://arxiv.org/abs/2304.07666v2
- Date: Sat, 23 Sep 2023 14:05:58 GMT
- Title: ArguGPT: evaluating, understanding and identifying argumentative essays
generated by GPT models
- Authors: Yikang Liu, Ziyin Zhang, Wanyang Zhang, Shisen Yue, Xiaojing Zhao,
Xinyuan Cheng, Yiwen Zhang, Hai Hu
- Abstract summary: We first present ArguGPT, a balanced corpus of 4,038 argumentative essays generated by 7 GPT models.
We then hire English instructors to distinguish machine essays from human ones.
Results show that when first exposed to machine-generated essays, the instructors only have an accuracy of 61% in detecting them.
- Score: 9.483206389157509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AI generated content (AIGC) presents considerable challenge to educators
around the world. Instructors need to be able to detect such text generated by
large language models, either with the naked eye or with the help of some
tools. There is also growing need to understand the lexical, syntactic and
stylistic features of AIGC. To address these challenges in English language
teaching, we first present ArguGPT, a balanced corpus of 4,038 argumentative
essays generated by 7 GPT models in response to essay prompts from three
sources: (1) in-class or homework exercises, (2) TOEFL and (3) GRE writing
tasks. Machine-generated texts are paired with roughly equal number of
human-written essays with three score levels matched in essay prompts. We then
hire English instructors to distinguish machine essays from human ones. Results
show that when first exposed to machine-generated essays, the instructors only
have an accuracy of 61% in detecting them. But the number rises to 67% after
one round of minimal self-training. Next, we perform linguistic analyses of
these essays, which show that machines produce sentences with more complex
syntactic structures while human essays tend to be lexically more complex.
Finally, we test existing AIGC detectors and build our own detectors using SVMs
and RoBERTa. Results suggest that a RoBERTa fine-tuned with the training set of
ArguGPT achieves above 90% accuracy in both essay- and sentence-level
classification. To the best of our knowledge, this is the first comprehensive
analysis of argumentative essays produced by generative large language models.
Machine-authored essays in ArguGPT and our models will be made publicly
available at https://github.com/huhailinguist/ArguGPT
Related papers
- A School Student Essay Corpus for Analyzing Interactions of Argumentative Structure and Quality [12.187586364960758]
We present a German corpus of 1,320 essays from school students of two age groups.
Each essay has been manually annotated for argumentative structure and quality on multiple levels of granularity.
We propose baseline approaches to argument mining and essay scoring, and we analyze interactions between both tasks.
arXiv Detail & Related papers (2024-04-03T07:31:53Z) - FacTool: Factuality Detection in Generative AI -- A Tool Augmented
Framework for Multi-Task and Multi-Domain Scenarios [87.12753459582116]
A wider range of tasks now face an increasing risk of containing factual errors when handled by generative models.
We propose FacTool, a task and domain agnostic framework for detecting factual errors of texts generated by large language models.
arXiv Detail & Related papers (2023-07-25T14:20:51Z) - AI, write an essay for me: A large-scale comparison of human-written
versus ChatGPT-generated essays [66.36541161082856]
ChatGPT and similar generative AI models have attracted hundreds of millions of users.
This study compares human-written versus ChatGPT-generated argumentative student essays.
arXiv Detail & Related papers (2023-04-24T12:58:28Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue [92.01165203498299]
Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange.
This paper argues that imitation learning (IL) and related low-level metrics are actually misleading and do not align with the goals of embodied dialogue research.
arXiv Detail & Related papers (2022-10-10T05:51:40Z) - How Large Language Models are Transforming Machine-Paraphrased
Plagiarism [3.8768839735240737]
This work explores T5 and GPT-3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia.
We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software.
Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts.
arXiv Detail & Related papers (2022-10-07T14:08:57Z) - Automatic Summarization of Russian Texts: Comparison of Extractive and
Abstractive Methods [0.0]
Key problem of the argument text generation for the Russian language is the lack of annotated argumentation corpora.
In this paper, we use translated versions of the Argumentative Microtext, Persuasive Essays and UKP Sentential corpora to fine-tune RuBERT model.
The results show that this approach improves the accuracy of the argument generation by more than 20 percentage points compared to the original ruGPT-3 model.
arXiv Detail & Related papers (2022-06-18T17:28:04Z) - Argumentative Text Generation in Economic Domain [0.11470070927586015]
Key problem of the argument text generation for the Russian language is the lack of annotated argumentation corpora.
In this paper, we use translated versions of the Argumentative Microtext, Persuasive Essays and UKP Sentential corpora to fine-tune RuBERT model.
The results show that this approach improves the accuracy of the argument generation by more than 20 percentage points compared to the original ruGPT-3 model.
arXiv Detail & Related papers (2022-06-18T17:22:06Z) - RuArg-2022: Argument Mining Evaluation [69.87149207721035]
This paper is a report of the organizers on the first competition of argumentation analysis systems dealing with Russian language texts.
A corpus containing 9,550 sentences (comments on social media posts) on three topics related to the COVID-19 pandemic was prepared.
The system that won the first place in both tasks used the NLI (Natural Language Inference) variant of the BERT architecture.
arXiv Detail & Related papers (2022-06-18T17:13:37Z) - My Teacher Thinks The World Is Flat! Interpreting Automatic Essay
Scoring Mechanism [71.34160809068996]
Recent work shows that automated scoring systems are prone to even common-sense adversarial samples.
We utilize recent advances in interpretability to find the extent to which features such as coherence, content and relevance are important for automated scoring mechanisms.
We also find that since the models are not semantically grounded with world-knowledge and common sense, adding false facts such as the world is flat'' actually increases the score instead of decreasing it.
arXiv Detail & Related papers (2020-12-27T06:19:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.