Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive
Investigation of Accuracy, Fairness, and Generalizability
- URL: http://arxiv.org/abs/2401.05655v1
- Date: Thu, 11 Jan 2024 04:28:02 GMT
- Title: Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive
Investigation of Accuracy, Fairness, and Generalizability
- Authors: Kaixun Yang, Mladen Rakovi\'c, Yuyang Li, Quanlong Guan, Dragan
Ga\v{s}evi\'c, Guanliang Chen
- Abstract summary: This study aims to uncover the intricate relationship between an AES model's accuracy, fairness, and generalizability.
We evaluate nine prominent AES methods and evaluate their performance using seven metrics on an open-sourced dataset.
- Score: 5.426458555881673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic Essay Scoring (AES) is a well-established educational pursuit that
employs machine learning to evaluate student-authored essays. While much effort
has been made in this area, current research primarily focuses on either (i)
boosting the predictive accuracy of an AES model for a specific prompt (i.e.,
developing prompt-specific models), which often heavily relies on the use of
the labeled data from the same target prompt; or (ii) assessing the
applicability of AES models developed on non-target prompts to the intended
target prompt (i.e., developing the AES models in a cross-prompt setting).
Given the inherent bias in machine learning and its potential impact on
marginalized groups, it is imperative to investigate whether such bias exists
in current AES methods and, if identified, how it intervenes with an AES
model's accuracy and generalizability. Thus, our study aimed to uncover the
intricate relationship between an AES model's accuracy, fairness, and
generalizability, contributing practical insights for developing effective AES
models in real-world education. To this end, we meticulously selected nine
prominent AES methods and evaluated their performance using seven metrics on an
open-sourced dataset, which contains over 25,000 essays and various demographic
information about students such as gender, English language learner status, and
economic status. Through extensive evaluations, we demonstrated that: (1)
prompt-specific models tend to outperform their cross-prompt counterparts in
terms of predictive accuracy; (2) prompt-specific models frequently exhibit a
greater bias towards students of different economic statuses compared to
cross-prompt models; (3) in the pursuit of generalizability, traditional
machine learning models coupled with carefully engineered features hold greater
potential for achieving both high accuracy and fairness than complex neural
network models.
Related papers
- From Efficiency to Equity: Measuring Fairness in Preference Learning [3.2132738637761027]
We evaluate fairness in preference learning models inspired by economic theories of inequality and Rawlsian justice.
We propose metrics adapted from the Gini Coefficient, Atkinson Index, and Kuznets Ratio to quantify fairness in these models.
arXiv Detail & Related papers (2024-10-24T15:25:56Z) - FAIREDU: A Multiple Regression-Based Method for Enhancing Fairness in Machine Learning Models for Educational Applications [1.24497353837144]
This paper introduces FAIREDU, a novel and effective method designed to improve fairness across multiple sensitive features.
Through extensive experiments, we evaluate FAIREDU effectiveness in enhancing fairness without compromising model performance.
The results demonstrate that FAIREDU addresses intersectionality across features such as gender, race, age, and other sensitive features, outperforming state-of-the-art methods with minimal effect on model accuracy.
arXiv Detail & Related papers (2024-10-08T23:29:24Z) - Phrase-Level Adversarial Training for Mitigating Bias in Neural Network-based Automatic Essay Scoring [0.0]
We propose a model-agnostic phrase-level method to generate an adversarial essay set to address the biases and robustness of AES models.
Experimental results show that the proposed approach significantly improves AES model performance in the presence of adversarial examples and scenarios.
arXiv Detail & Related papers (2024-09-07T11:22:35Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - On the Robustness of Aspect-based Sentiment Analysis: Rethinking Model,
Data, and Training [109.9218185711916]
Aspect-based sentiment analysis (ABSA) aims at automatically inferring the specific sentiment polarities toward certain aspects of products or services behind social media texts or reviews.
We propose to enhance the ABSA robustness by systematically rethinking the bottlenecks from all possible angles, including model, data, and training.
arXiv Detail & Related papers (2023-04-19T11:07:43Z) - End-to-End Speech Recognition: A Survey [68.35707678386949]
The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements.
All relevant aspects of E2E ASR are covered in this work, accompanied by discussions of performance and deployment opportunities.
arXiv Detail & Related papers (2023-03-03T01:46:41Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Do we need to go Deep? Knowledge Tracing with Big Data [5.218882272051637]
We use EdNet, the largest student interaction dataset publicly available in the education domain, to understand how accurately both deep and traditional models predict future student performances.
Our work observes that logistic regression models with carefully engineered features outperformed deep models from extensive experimentation.
arXiv Detail & Related papers (2021-01-20T22:40:38Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z) - Predicting Engagement in Video Lectures [24.415345855402624]
We introduce a novel, large dataset of video lectures for predicting context-agnostic engagement.
We propose both cross-modal and modality-specific feature sets to achieve this task.
We demonstrate the use of our approach in the case of data scarcity.
arXiv Detail & Related papers (2020-05-31T19:28:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.