Related papers: Artificial Intelligence Bias on English Language Learners in Automatic Scoring

Artificial Intelligence Bias on English Language Learners in Automatic Scoring

URL: http://arxiv.org/abs/2505.10643v2
Date: Mon, 19 May 2025 21:42:42 GMT
Title: Artificial Intelligence Bias on English Language Learners in Automatic Scoring
Authors: Shuchen Guo, Yun Wang, Jichao Yu, Xuansheng Wu, Bilgehan Ayik, Field M. Watts, Ehsan Latif, Ninghao Liu, Lei Liu, Xiaoming Zhai,
Abstract summary: We fine-tuned BERT with four datasets: responses from ELLs, (2) non-ELLs, (3) a mixed dataset reflecting the real-world proportion of ELLs and non-ELLs, and (4) a balanced mixed dataset with equal representation of both groups.<n>We measured the Mean Score Gaps (MSGs) between ELLs and non-ELLs and then calculated the differences in MSGs generated through both the human and AI models to identify the scoring disparities.<n>We found that no AI bias and distorted disparities between ELLs and non-ELLs were found when the training dataset was large
Score: 23.76046619016318
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study investigated potential scoring biases and disparities toward English Language Learners (ELLs) when using automatic scoring systems for middle school students' written responses to science assessments. We specifically focus on examining how unbalanced training data with ELLs contributes to scoring bias and disparities. We fine-tuned BERT with four datasets: responses from (1) ELLs, (2) non-ELLs, (3) a mixed dataset reflecting the real-world proportion of ELLs and non-ELLs (unbalanced), and (4) a balanced mixed dataset with equal representation of both groups. The study analyzed 21 assessment items: 10 items with about 30,000 ELL responses, five items with about 1,000 ELL responses, and six items with about 200 ELL responses. Scoring accuracy (Acc) was calculated and compared to identify bias using Friedman tests. We measured the Mean Score Gaps (MSGs) between ELLs and non-ELLs and then calculated the differences in MSGs generated through both the human and AI models to identify the scoring disparities. We found that no AI bias and distorted disparities between ELLs and non-ELLs were found when the training dataset was large enough (ELL = 30,000 and ELL = 1,000), but concerns could exist if the sample size is limited (ELL = 200).

Related papers

ChatGPT for automated grading of short answer questions in mechanical ventilation [0.0]
Large language models (LLMs) simulate conversational language and interpret unstructured free-text responses.<n>We evaluated ChatGPT 4o to grade SAQs in a postgraduate medical setting using data from 215 students.
arXiv Detail & Related papers (2025-05-05T19:04:25Z)
Everyone deserves their voice to be heard: Analyzing Predictive Gender Bias in ASR Models Applied to Dutch Speech Data [13.91630413828167]
This study focuses on identifying the performance disparities of Whisper models on Dutch speech data. We analyzed the word error rate, character error rate and a BERT-based semantic similarity across gender groups.
arXiv Detail & Related papers (2024-11-14T13:29:09Z)
AI Gender Bias, Disparities, and Fairness: Does Training Data Matter? [3.509963616428399]
This study delves into the pervasive issue of gender issues in artificial intelligence (AI)<n>It analyzes more than 1000 human-graded student responses from male and female participants across six assessment items.<n>Results indicate that scoring accuracy for mixed-trained models shows an insignificant difference from either male- or female-trained models.
arXiv Detail & Related papers (2023-12-17T22:37:06Z)
Assessing Student Errors in Experimentation Using Artificial Intelligence and Large Language Models: A Comparative Study with Human Raters [9.899633398596672]
We investigate the potential of Large Language Models (LLMs) for automatically identifying student errors. An AI system based on the GPT-3.5 and GPT-4 series was developed and tested against human raters. Our results indicate varying levels of accuracy in error detection between the AI system and human raters.
arXiv Detail & Related papers (2023-08-11T12:03:12Z)
D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases. A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network. For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z)
Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs) We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
Fairness in Cardiac MR Image Analysis: An Investigation of Bias Due to Data Imbalance in Deep Learning Based Segmentation [1.6386696247541932]
"Fairness" in AI refers to assessing algorithms for potential bias based on demographic characteristics such as race and gender. Deep learning (DL) in cardiac MR segmentation has led to impressive results in recent years, but no work has yet investigated the fairness of such models. We find statistically significant differences in Dice performance between different racial groups.
arXiv Detail & Related papers (2021-06-23T13:27:35Z)
What Can We Learn from Collective Human Opinions on Natural Language Inference Data? [88.90490998032429]
ChaosNLI is a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS. This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive-NLI.
arXiv Detail & Related papers (2020-10-07T17:26:06Z)
LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model. We propose LOGAN, a new bias detection technique based on clustering. Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.