Related papers: mucAI at BAREC Shared Task 2025: Towards Uncertainty Aware Arabic Readability Assessment

mucAI at BAREC Shared Task 2025: Towards Uncertainty Aware Arabic Readability Assessment

URL: http://arxiv.org/abs/2509.15485v1
Date: Thu, 18 Sep 2025 23:14:51 GMT
Title: mucAI at BAREC Shared Task 2025: Towards Uncertainty Aware Arabic Readability Assessment
Authors: Ahmed Abdou,
Abstract summary: We present a model-agnostic technique for fine-grained Arabic readability classification in the BAREC 2025 Shared Task.<n>Our method applies conformal prediction to generate prediction sets with coverage guarantees, then computes weighted averages using softmax-renormalized probabilities over the conformal sets.<n>This uncertainty-aware decoding improves Quadratic Weighted Kappa (QWK) by reducing high-penalty misclassifications to nearer levels.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a simple, model-agnostic post-processing technique for fine-grained Arabic readability classification in the BAREC 2025 Shared Task (19 ordinal levels). Our method applies conformal prediction to generate prediction sets with coverage guarantees, then computes weighted averages using softmax-renormalized probabilities over the conformal sets. This uncertainty-aware decoding improves Quadratic Weighted Kappa (QWK) by reducing high-penalty misclassifications to nearer levels. Our approach shows consistent QWK improvements of 1-3 points across different base models. In the strict track, our submission achieves QWK scores of 84.9\%(test) and 85.7\% (blind test) for sentence level, and 73.3\% for document level. For Arabic educational assessment, this enables human reviewers to focus on a handful of plausible levels, combining statistical guarantees with practical usability.

Related papers

Conformal Prediction Sets for Next-Token Prediction in Large Language Models: Balancing Coverage Guarantees with Set Efficiency [0.0]
We present a systematic study of Adaptive Prediction Sets (APS) applied to next-token prediction in transformer-based models with large vocabularies.<n>We propose Vocabulary-Aware Conformal Prediction (VACP) to reduce the effective prediction space while provably maintaining marginal coverage.
arXiv Detail & Related papers (2025-12-27T19:08:54Z)
Beyond the Score: Uncertainty-Calibrated LLMs for Automated Essay Assessment [11.525382140783043]
This work combines conformal prediction and UAcc for essay scoring.<n> Reliability is assessed with UAcc, an uncertainty-aware accuracy that rewards models for being both correct and concise.<n>Open-source, mid-sized LLMs can already support teacher-in-the-loop AES.
arXiv Detail & Related papers (2025-09-19T12:28:50Z)
!MSA at BAREC Shared Task 2025: Ensembling Arabic Transformers for Readability Assessment [0.0]
We present MSAs winning system for the BAREC 2025 Shared Task on fine-grained Arabic readability assessment.<n>Our approach is a confidence-weighted ensemble of four complementary transformer models.<n>System reached 87.5 percent QWK at the sentence level and 87.4 percent at the document level.
arXiv Detail & Related papers (2025-09-12T08:08:45Z)
COIN: Uncertainty-Guarding Selective Question Answering for Foundation Models with Provable Risk Guarantees [51.5976496056012]
COIN is an uncertainty-guarding selection framework that calibrates statistically valid thresholds to filter a single generated answer per question.<n>COIN estimates the empirical error rate on a calibration set and applies confidence interval methods to establish a high-probability upper bound on the true error rate.<n>We demonstrate COIN's robustness in risk control, strong test-time power in retaining admissible answers, and predictive efficiency under limited calibration data.
arXiv Detail & Related papers (2025-06-25T07:04:49Z)
Conformal Prediction Sets with Improved Conditional Coverage using Trust Scores [52.92618442300405]
It is impossible to achieve exact, distribution-free conditional coverage in finite samples.<n>We propose an alternative conformal prediction algorithm that targets coverage where it matters most.
arXiv Detail & Related papers (2025-01-17T12:01:56Z)
RICA2: Rubric-Informed, Calibrated Assessment of Actions [8.641411594566714]
We present RICA2 - a deep probabilistic model that score rubric and accounts for prediction uncertainty for action quality assessment (AQA) We demonstrate that our method establishes new state of the art on public benchmarks, including FineDiving, MTL-AQA, and JIGSAWS, with superior performance in score prediction and uncertainty calibration.
arXiv Detail & Related papers (2024-08-04T20:35:33Z)
TCE at Qur'an QA 2023 Shared Task: Low Resource Enhanced Transformer-based Ensemble Approach for Qur'anic QA [0.0]
We present our approach to tackle Qur'an QA 2023 shared tasks A and B. To address the challenge of low-resourced training data, we rely on transfer learning together with a voting ensemble. We employ different architectures and learning mechanisms for a range of Arabic pre-trained transformer-based models for both tasks.
arXiv Detail & Related papers (2024-01-23T19:32:54Z)
Weak Supervision Performance Evaluation via Partial Identification [46.73061437177238]
Programmatic Weak Supervision (PWS) enables supervised model training without direct access to ground truth labels. We present a novel method to address this challenge by framing model evaluation as a partial identification problem. Our approach derives reliable bounds on key metrics without requiring labeled data, overcoming core limitations in current weak supervision evaluation techniques.
arXiv Detail & Related papers (2023-12-07T07:15:11Z)
Equal Opportunity of Coverage in Fair Regression [50.76908018786335]
We study fair machine learning (ML) under predictive uncertainty to enable reliable and trustworthy decision-making. We propose Equal Opportunity of Coverage (EOC) that aims to achieve two properties: (1) coverage rates for different groups with similar outcomes are close, and (2) the coverage rate for the entire population remains at a predetermined level.
arXiv Detail & Related papers (2023-11-03T21:19:59Z)
Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees [63.62448343531963]
We propose a combination of the existing paradigms, sampling responses to be scored by humans intelligently. We observe significant gains in accuracy (19.80% increase on average) and quadratic weighted kappa (QWK) (25.60% on average) with a relatively small human budget.
arXiv Detail & Related papers (2021-11-17T05:00:51Z)
Distribution-free uncertainty quantification for classification under label shift [105.27463615756733]
We focus on uncertainty quantification (UQ) for classification problems via two avenues. We first argue that label shift hurts UQ, by showing degradation in coverage and calibration. We examine these techniques theoretically in a distribution-free framework and demonstrate their excellent practical performance.
arXiv Detail & Related papers (2021-03-04T20:51:03Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.