Bayesian Active Learning for Multi-Criteria Comparative Judgement in Educational Assessment
- URL: http://arxiv.org/abs/2503.00479v2
- Date: Sun, 23 Mar 2025 20:08:21 GMT
- Title: Bayesian Active Learning for Multi-Criteria Comparative Judgement in Educational Assessment
- Authors: Andy Gray, Alma Rahat, Tom Crick, Stephen Lindsay,
- Abstract summary: Comparative Judgement (CJ) provides an alternative assessment approach by evaluating work holistically rather than breaking it into discrete criteria.<n>This method leverages human ability to make nuanced comparisons, yielding more reliable and valid assessments.<n> rubrics remain widely used in education, offering structured criteria for grading and detailed feedback.<n>This creates a gap between CJ's holistic ranking and the need for criterion-based performance breakdowns.
- Score: 3.0098452499209705
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Comparative Judgement (CJ) provides an alternative assessment approach by evaluating work holistically rather than breaking it into discrete criteria. This method leverages human ability to make nuanced comparisons, yielding more reliable and valid assessments. CJ aligns with real-world evaluations, where overall quality emerges from the interplay of various elements. However, rubrics remain widely used in education, offering structured criteria for grading and detailed feedback. This creates a gap between CJ's holistic ranking and the need for criterion-based performance breakdowns. This paper addresses this gap using a Bayesian approach. We build on Bayesian CJ (BCJ) by Gray et al., which directly models preferences instead of using likelihoods over total scores, allowing for expected ranks with uncertainty estimation. Their entropy-based active learning method selects the most informative pairwise comparisons for assessors. We extend BCJ to handle multiple independent learning outcome (LO) components, defined by a rubric, enabling both holistic and component-wise predictive rankings with uncertainty estimates. Additionally, we propose a method to aggregate entropies and identify the most informative comparison for assessors. Experiments on synthetic and real data demonstrate our method's effectiveness. Finally, we address a key limitation of BCJ, which is the inability to quantify assessor agreement. We show how to derive agreement levels, enhancing transparency in assessment.
Related papers
- Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol can significantly affect evaluation reliability and induce systematic biases.
In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation.
arXiv Detail & Related papers (2025-04-20T19:05:59Z) - Rendering Transparency to Ranking in Educational Assessment via Bayesian Comparative Judgement [2.8054775602970743]
This paper examines how Comparative Judgement (BCJ) enhances transparency by integrating prior information into the judgement process.
BCJ assigns probabilities to judgement outcomes, offering quantifiable measures of uncertainty and deeper insights into decision confidence.
We highlight the benefits and limitations of BCJ, offering insights into its real-world application across various educational settings.
arXiv Detail & Related papers (2025-03-17T20:56:55Z) - Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge [90.8674158031845]
We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses.
This process effectively guides LLM-as-a-Judge to provide a more detailed chain-of-thought (CoT) judgment.
Our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling.
arXiv Detail & Related papers (2025-02-18T03:31:06Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - Evaluating Agents using Social Choice Theory [20.58298173034909]
We argue that many general evaluation problems can be viewed through the lens of voting theory.<n>Each task is interpreted as a separate voter, which requires only ordinal rankings or pairwise comparisons of agents to produce an overall evaluation.<n>These evaluations are interpretable and flexible, while avoiding many of the problems currently facing cross-task evaluation.
arXiv Detail & Related papers (2023-12-05T20:40:37Z) - A Bayesian Active Learning Approach to Comparative Judgement [3.0098452499209705]
Traditional marking is a source of inconsistencies and unconscious bias, placing a high cognitive load on the assessor.
In CJ, the assessor is presented with a pair of items and is asked to select the better one.
While CJ is considered a reliable method for marking, there are concerns around transparency.
We propose a novel Bayesian approach to CJ (BCJ) for determining the ranks of compared items.
arXiv Detail & Related papers (2023-08-25T10:33:44Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.