Related papers: BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

URL: http://arxiv.org/abs/2602.23580v1
Date: Fri, 27 Feb 2026 01:11:05 GMT
Title: BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation
Authors: Yun Wang, Xuansheng Wu, Jingyuan Huang, Lei Liu, Xiaoming Zhai, Ninghao Liu,
Abstract summary: We propose BRIDGE, a Bias-Reducing Inter-group Data GEneration framework for low-resource assessment settings.<n>We show that BRIDGE effectively reduces prediction bias for high-scoring ELL students while maintaining overall scoring performance.
Score: 33.11188827947722
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the field of educational assessment, automated scoring systems increasingly rely on deep learning and large language models (LLMs). However, these systems face significant risks of bias amplification, where model prediction gaps between student groups become larger than those observed in training data. This issue is especially severe for underrepresented groups such as English Language Learners (ELLs), as models may inherit and further magnify existing disparities in the data. We identify that this issue is closely tied to representation bias: the scarcity of minority (high-scoring ELL) samples makes models trained with empirical risk minimization favor majority (non-ELL) linguistic patterns. Consequently, models tend to under-predict ELL students who even demonstrate comparable domain knowledge but use different linguistic patterns, thereby undermining the fairness of automated scoring outcomes. To mitigate this, we propose BRIDGE, a Bias-Reducing Inter-group Data GEneration framework designed for low-resource assessment settings. Instead of relying on the limited minority samples, BRIDGE synthesizes high-scoring ELL samples by "pasting" construct-relevant (i.e., rubric-aligned knowledge and evidence) content from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns. We further introduce a discriminator model to ensure the quality of synthetic samples. Experiments on California Science Test (CAST) datasets demonstrate that BRIDGE effectively reduces prediction bias for high-scoring ELL students while maintaining overall scoring performance. Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.

Related papers

A Comprehensive Study of Implicit and Explicit Biases in Large Language Models [1.0555164678638427]
This study highlights the need to address biases in Large Language Models amid growing generative AI.<n>We studied bias-specific benchmarks such as StereoSet and CrowSPairs to evaluate the existence of various biases in multiple generative models such as BERT and GPT 3.5.<n>Results indicated fine-tuned models struggle with gender biases but excelled at identifying and avoiding racial biases.
arXiv Detail & Related papers (2025-11-18T05:27:17Z)
Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation [12.56588481992456]
Large Language Models have been shown to demonstrate stereotypical biases in their representations and behavior.<n>We introduce a novel and general augmentation framework that involves three plug-and-play steps.<n>We find that Large Language Models are susceptible to perturbations to their inputs, showcasing a higher likelihood to behave stereotypically.
arXiv Detail & Related papers (2025-10-27T23:05:12Z)
Detecting Prefix Bias in LLM-based Reward Models [4.596249232904721]
We introduce novel methods to detect and evaluate prefix bias in reward models trained on preference datasets.<n>We leverage these metrics to reveal significant biases in preference models across racial and gender dimensions.<n>Our findings highlight the critical need for bias-aware dataset design and evaluation in developing fair and reliable reward models.
arXiv Detail & Related papers (2025-05-13T21:50:03Z)
Improving Group Fairness in Knowledge Distillation via Laplace Approximation of Early Exits [0.0]
We propose that leveraging Laplace approximation-based methods to obtain well-calibrated uncertainty estimates can also effectively reweight challenging instances.<n>To validate our claims, we benchmark our approach using a Bert-based model on the MultiNLI dataset.
arXiv Detail & Related papers (2025-05-02T07:18:52Z)
Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers. We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes. We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z)
Non-Invasive Fairness in Learning through the Lens of Data Drift [88.37640805363317]
We show how to improve the fairness of Machine Learning models without altering the data or the learning algorithm. We use a simple but key insight: the divergence of trends between different populations, and, consecutively, between a learned model and minority populations, is analogous to data drift. We explore two strategies (model-splitting and reweighing) to resolve this drift, aiming to improve the overall conformance of models to the underlying data.
arXiv Detail & Related papers (2023-03-30T17:30:42Z)
Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy. We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples. Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z)
Debiasing Vision-Language Models via Biased Prompts [79.04467131711775]
We propose a general approach for debiasing vision-language foundation models by projecting out biased directions in the text embedding. We show that debiasing only the text embedding with a calibrated projection matrix suffices to yield robust classifiers and fair generative models.
arXiv Detail & Related papers (2023-01-31T20:09:33Z)
General Greedy De-bias Learning [163.65789778416172]
We propose a General Greedy De-bias learning framework (GGD), which greedily trains the biased models and the base model like gradient descent in functional space. GGD can learn a more robust base model under the settings of both task-specific biased models with prior knowledge and self-ensemble biased model without prior knowledge.
arXiv Detail & Related papers (2021-12-20T14:47:32Z)
Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs) We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z)
An Investigation of Why Overparameterization Exacerbates Spurious Correlations [98.3066727301239]
We identify two key properties of the training data that drive this behavior. We show how the inductive bias of models towards "memorizing" fewer examples can cause over parameterization to hurt.
arXiv Detail & Related papers (2020-05-09T01:59:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.