Argument Rarity-based Originality Assessment for AI-Assisted Writing
- URL: http://arxiv.org/abs/2602.01560v1
- Date: Mon, 02 Feb 2026 02:54:29 GMT
- Title: Argument Rarity-based Originality Assessment for AI-Assisted Writing
- Authors: Keito Inoshita, Michiaki Omura, Tsukasa Yamanaka, Go Maeda, Kentaro Tsuji,
- Abstract summary: This study proposes Argument Rarity-based Originality Assessment (AROA), a framework for automatically evaluating argumentative originality in student essays.<n>AROA defines originality as rarity within a reference corpus and evaluates it through four complementary components: structural rarity, claim rarity, evidence rarity, and cognitive depth.
- Score: 0.09786690381850356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) have become capable of effortlessly generating high-quality text, traditional quality-focused writing assessment is losing its significance. If the essential goal of education is to foster critical thinking and original perspectives, assessment must also shift its paradigm from quality to originality. This study proposes Argument Rarity-based Originality Assessment (AROA), a framework for automatically evaluating argumentative originality in student essays. AROA defines originality as rarity within a reference corpus and evaluates it through four complementary components: structural rarity, claim rarity, evidence rarity, and cognitive depth. The framework quantifies the rarity of each component using density estimation and integrates them with a quality adjustment mechanism, thereby treating quality and originality as independent evaluation axes. Experiments using human essays and AI-generated essays revealed a strong negative correlation between quality and claim rarity, demonstrating a quality-originality trade-off where higher-quality texts tend to rely on typical claim patterns. Furthermore, while AI essays achieved comparable levels of structural complexity to human essays, their claim rarity was substantially lower than that of humans, indicating that LLMs can reproduce the form of argumentation but have limitations in the originality of content.
Related papers
- Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays [15.895792302323883]
In educational contexts, teachers and learners require interpretable, trait-level feedback.<n>We study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms.<n>We show that explicitly modeling score ordinality substantially improves agreement with human raters.
arXiv Detail & Related papers (2026-02-04T14:30:52Z) - ScholarPeer: A Context-Aware Multi-Agent Framework for Automated Peer Review [48.60540055009675]
ScholarPeer is a search-enabled multi-agent framework designed to emulate the cognitive processes of a senior researcher.<n>We evaluate ScholarPeer on DeepReview-13K and the results demonstrate that ScholarPeer achieves significant win-rates against state-of-the-art approaches in side-by-side evaluations.
arXiv Detail & Related papers (2026-01-30T06:54:55Z) - Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z) - Assessing the Reliability and Validity of Large Language Models for Automated Assessment of Student Essays in Higher Education [0.30158609733245967]
Five advanced Large Language Models (LLMs), Claude 3.5, DeepSeek v2, Gemini 2.5, GPT-4, and Mistral 24B, were investigated for automated essay scoring in a higher education context.<n>A total of 67 Italian-language student essays were evaluated using a four-criterion rubric.<n>Human-LLM agreement was consistently low and non-significant, and within-model reliability across replications was similarly weak.
arXiv Detail & Related papers (2025-08-04T14:02:12Z) - How do Humans and Language Models Reason About Creativity? A Comparative Analysis [12.398832289718703]
We conducted two experiments examining how including example solutions with ratings impact creativity evaluation.<n>In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training.<n>In Study 2, parallel analyses with state-of-the-art LLMs revealed that models prioritized uncommonness and remoteness of ideas when rating originality.
arXiv Detail & Related papers (2025-02-05T15:08:43Z) - AI-generated Essays: Characteristics and Implications on Automated Scoring and Academic Integrity [13.371946973050845]
We examine and benchmark the characteristics and quality of essays generated by popular large language models (LLMs)<n>Our findings highlight limitations in existing automated scoring systems, and identify areas for improvement.<n>Despite concerns that the increasing variety of LLMs may undermine the feasibility of detecting AI-generated essays, our results show that detectors trained on essays generated from one model can often identify texts from others with high accuracy.
arXiv Detail & Related papers (2024-10-22T21:30:58Z) - HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical
Criteria Decomposition [92.17397504834825]
HD-Eval is a framework that iteratively aligns large language models evaluators with human preference.
HD-Eval inherits the essence from the evaluation mindset of human experts and enhances the alignment of LLM-based evaluators.
Extensive experiments on three evaluation domains demonstrate the superiority of HD-Eval in further aligning state-of-the-art evaluators.
arXiv Detail & Related papers (2024-02-24T08:01:32Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution [48.86322922826514]
This paper defines a new task of Knowledge-aware Language Model Attribution (KaLMA)
First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios.
Second, we propose a new Conscious Incompetence" setting considering the incomplete knowledge repository.
Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment.
arXiv Detail & Related papers (2023-10-09T11:45:59Z) - Contextualizing Argument Quality Assessment with Relevant Knowledge [11.367297319588411]
SPARK is a novel method for scoring argument quality based on contextualization via relevant knowledge.
We devise four augmentations that leverage large language models to provide feedback, infer hidden assumptions, supply a similar-quality argument, or give a counter-argument.
arXiv Detail & Related papers (2023-05-20T21:04:58Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - Measuring Association Between Labels and Free-Text Rationales [60.58672852655487]
In interpretable NLP, we require faithful rationales that reflect the model's decision-making process for an explained instance.
We demonstrate that pipelines, existing models for faithful extractive rationalization on information-extraction style tasks, do not extend as reliably to "reasoning" tasks requiring free-text rationales.
We turn to models that jointly predict and rationalize, a class of widely used high-performance models for free-text rationalization whose faithfulness is not yet established.
arXiv Detail & Related papers (2020-10-24T03:40:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.