Rendering Transparency to Ranking in Educational Assessment via Bayesian Comparative Judgement
- URL: http://arxiv.org/abs/2503.15549v1
- Date: Mon, 17 Mar 2025 20:56:55 GMT
- Title: Rendering Transparency to Ranking in Educational Assessment via Bayesian Comparative Judgement
- Authors: Andy Gray, Alma Rahat, Stephen Lindsay, Jen Pearson, Tom Crick,
- Abstract summary: This paper examines how Comparative Judgement (BCJ) enhances transparency by integrating prior information into the judgement process.<n>BCJ assigns probabilities to judgement outcomes, offering quantifiable measures of uncertainty and deeper insights into decision confidence.<n>We highlight the benefits and limitations of BCJ, offering insights into its real-world application across various educational settings.
- Score: 2.8054775602970743
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Ensuring transparency in educational assessment is increasingly critical, particularly post-pandemic, as demand grows for fairer and more reliable evaluation methods. Comparative Judgement (CJ) offers a promising alternative to traditional assessments, yet concerns remain about its perceived opacity. This paper examines how Bayesian Comparative Judgement (BCJ) enhances transparency by integrating prior information into the judgement process, providing a structured, data-driven approach that improves interpretability and accountability. BCJ assigns probabilities to judgement outcomes, offering quantifiable measures of uncertainty and deeper insights into decision confidence. By systematically tracking how prior data and successive judgements inform final rankings, BCJ clarifies the assessment process and helps identify assessor disagreements. Multi-criteria BCJ extends this by evaluating multiple learning outcomes (LOs) independently, preserving the richness of CJ while producing transparent, granular rankings aligned with specific assessment goals. It also enables a holistic ranking derived from individual LOs, ensuring comprehensive evaluations without compromising detailed feedback. Using a real higher education dataset with professional markers in the UK, we demonstrate BCJ's quantitative rigour and ability to clarify ranking rationales. Through qualitative analysis and discussions with experienced CJ practitioners, we explore its effectiveness in contexts where transparency is crucial, such as high-stakes national assessments. We highlight the benefits and limitations of BCJ, offering insights into its real-world application across various educational settings.
Related papers
- Bayesian Active Learning for Multi-Criteria Comparative Judgement in Educational Assessment [3.0098452499209705]
Comparative Judgement (CJ) provides an alternative assessment approach by evaluating work holistically rather than breaking it into discrete criteria.<n>This method leverages human ability to make nuanced comparisons, yielding more reliable and valid assessments.<n> rubrics remain widely used in education, offering structured criteria for grading and detailed feedback.<n>This creates a gap between CJ's holistic ranking and the need for criterion-based performance breakdowns.
arXiv Detail & Related papers (2025-03-01T13:12:41Z) - Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge [90.8674158031845]
We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses.<n>This process effectively guides LLM-as-a-Judge to provide a more detailed chain-of-thought (CoT) judgment.<n>Our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling.
arXiv Detail & Related papers (2025-02-18T03:31:06Z) - HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF)<n>In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination.<n>We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z) - CARMO: Dynamic Criteria Generation for Context-Aware Reward Modelling [27.86204841898399]
Reward modeling in large language models is susceptible to reward hacking.<n>We propose Context-Aware Reward Modeling (CARMO) to mitigate this problem.<n>We establish a new state-of-the-art performance in zero-shot settings for generative models, achieving a 2.1% improvement on Reward Bench.
arXiv Detail & Related papers (2024-10-28T21:18:49Z) - Pessimistic Evaluation [58.736490198613154]
We argue that evaluating information access systems assumes utilitarian values not aligned with traditions of information access based on equal access.
We advocate for pessimistic evaluation of information access systems focusing on worst case utility.
arXiv Detail & Related papers (2024-10-17T15:40:09Z) - Multi-Facet Counterfactual Learning for Content Quality Evaluation [48.73583736357489]
We propose a framework for efficiently constructing evaluators that perceive multiple facets of content quality evaluation.
We leverage a joint training strategy based on contrastive learning and supervised learning to enable the evaluator to distinguish between different quality facets.
arXiv Detail & Related papers (2024-10-10T08:04:10Z) - How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency [60.25969380388974]
Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs)<n>Current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance.<n>We propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score.
arXiv Detail & Related papers (2024-07-18T15:20:18Z) - Towards Explainability and Fairness in Swiss Judgement Prediction:
Benchmarking on a Multilingual Dataset [2.7463268699570134]
This study delves into the realm of explainability and fairness in Legal Judgement Prediction (LJP) models.
We evaluate the explainability performance of state-of-the-art monolingual and multilingual BERT-based LJP models.
We introduce a novel evaluation framework, Lower Court Insertion (LCI), which allows us to quantify the influence of lower court information on model predictions.
arXiv Detail & Related papers (2024-02-26T20:42:40Z) - Unveiling Bias in Fairness Evaluations of Large Language Models: A
Critical Literature Review of Music and Movie Recommendation Systems [0.0]
The rise of generative artificial intelligence, particularly Large Language Models (LLMs), has intensified the imperative to scrutinize fairness alongside accuracy.
Recent studies have begun to investigate fairness evaluations for LLMs within domains such as recommendations.
Yet, the degree to which current fairness evaluation frameworks account for personalization remains unclear.
arXiv Detail & Related papers (2024-01-08T17:57:29Z) - A Bayesian Active Learning Approach to Comparative Judgement [3.0098452499209705]
Traditional marking is a source of inconsistencies and unconscious bias, placing a high cognitive load on the assessor.
In CJ, the assessor is presented with a pair of items and is asked to select the better one.
While CJ is considered a reliable method for marking, there are concerns around transparency.
We propose a novel Bayesian approach to CJ (BCJ) for determining the ranks of compared items.
arXiv Detail & Related papers (2023-08-25T10:33:44Z) - Non-Comparative Fairness for Human-Auditing and Its Relation to
Traditional Fairness Notions [1.8275108630751837]
This paper proposes a new fairness notion based on the principle of non-comparative justice.
We show that any MLS can be deemed fair from the perspective of comparative fairness.
We also show that the converse holds true in the context of individual fairness.
arXiv Detail & Related papers (2021-06-29T20:05:22Z) - Interpretable Off-Policy Evaluation in Reinforcement Learning by
Highlighting Influential Transitions [48.91284724066349]
Off-policy evaluation in reinforcement learning offers the chance of using observational data to improve future outcomes in domains such as healthcare and education.
Traditional measures such as confidence intervals may be insufficient due to noise, limited data and confounding.
We develop a method that could serve as a hybrid human-AI system, to enable human experts to analyze the validity of policy evaluation estimates.
arXiv Detail & Related papers (2020-02-10T00:26:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.