Related papers: Aligned Textual Scoring Rules

Aligned Textual Scoring Rules

URL: http://arxiv.org/abs/2507.06221v1
Date: Tue, 08 Jul 2025 17:53:22 GMT
Title: Aligned Textual Scoring Rules
Authors: Yuxuan Lu, Yifan Wu, Jason Hartline, Michael J. Curry,
Abstract summary: A scoring rule is proper if, from the agent's perspective, reporting the true belief maximizes the expected score.<n>Our paper designs the Aligned Scoring rule (ASR) for text by optimizing and minimizing the mean squared error between a proper scoring rule and a reference score.
Score: 14.705645899416117
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scoring rules elicit probabilistic predictions from a strategic agent by scoring the prediction against a ground truth state. A scoring rule is proper if, from the agent's perspective, reporting the true belief maximizes the expected score. With the development of language models, Wu and Hartline (2024) proposes a reduction from textual information elicitation to the numerical (i.e. probabilistic) information elicitation problem, which achieves provable properness for textual elicitation. However, not all proper scoring rules are well aligned with human preference over text. Our paper designs the Aligned Scoring rule (ASR) for text by optimizing and minimizing the mean squared error between a proper scoring rule and a reference score (e.g. human score). Our experiments show that our ASR outperforms previous methods in aligning with human preference while maintaining properness.

Related papers

PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably.<n>This poses a significant challenge to ensuring their safe deployment.<n>We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
Group-Adaptive Threshold Optimization for Robust AI-Generated Text Detection [60.09665704993751]
We introduce FairOPT, an algorithm for group-specific threshold optimization for probabilistic AI-text detectors.<n>Our framework paves the way for more robust classification in AI-generated content detection via post-processing.
arXiv Detail & Related papers (2025-02-06T21:58:48Z)
Reducing Biases in Record Matching Through Scores Calibration [1.5530839016602822]
We propose a threshold-independent framework for measuring and reducing score bias.<n>We show that several state-of-the-art matching methods exhibit substantial score bias, even when appearing fair under standard threshold-based metrics.<n>We introduce two post-processing score calibration algorithms. The first, calib, aligns group-wise score distributions using the Wasserstein barycenter, targeting demographic parity.<n>The second, ccalib, conditions on predicted labels to further reduce label-dependent biases, such as equal opportunity.
arXiv Detail & Related papers (2024-11-03T21:01:40Z)
A Best-of-Both Approach to Improve Match Predictions and Reciprocal Recommendations for Job Search [15.585641615174623]
This paper introduces and demonstrates a novel and practical solution to improve reciprocal recommendations in production by leveraging pseudo-match scores. Specifically, our approach generates dense and more directly relevant pseudo-match scores by combining the true match labels with relatively inaccurate but dense match predictions. Our method can be seen as a best-of-both (BoB) approach, as it combines the high-level ideas of both direct match prediction and the two separate models approach.
arXiv Detail & Related papers (2024-09-17T08:51:02Z)
Language Generation with Strictly Proper Scoring Rules [70.340673452404]
We propose a strategy for adapting scoring rules to language generation, allowing for language modeling with any non-local scoring rules. We train language generation models using two classic strictly proper scoring rules, the Brier score and the Spherical score, as alternatives to the logarithmic score.
arXiv Detail & Related papers (2024-05-29T09:09:00Z)
Examining marginal properness in the external validation of survival models with squared and logarithmic losses [0.0]
We survey common squared and logarithmic scoring rules for survival analysis.<n>We show that both the Integrated Survival Brier Score (ISBS) and the Right-Censored Log-Likelihood (RCLL) are theoretically improper.<n>We advocate for both the RCLL and ISBS in external validation of models, including in automated procedures.
arXiv Detail & Related papers (2022-12-10T10:34:35Z)
Optimizing Partial Area Under the Top-k Curve: Theory and Practice [151.5072746015253]
We develop a novel metric named partial Area Under the top-k Curve (AUTKC) AUTKC has a better discrimination ability, and its Bayes optimal score function could give a correct top-K ranking with respect to the conditional probability. We present an empirical surrogate risk minimization framework to optimize the proposed metric.
arXiv Detail & Related papers (2022-09-03T11:09:13Z)
Optimal Scoring Rule Design under Partial Knowledge [9.759870160862205]
We study optimal scoring rules when the principal has partial knowledge of an agent's signal distribution. In our setting, the principal only knows about a set of distributions where the agent's signal distribution belongs. We propose an efficient algorithm to compute an optimal scoring rule when the set of distributions is finite.
arXiv Detail & Related papers (2021-07-15T16:05:48Z)
Post-Contextual-Bandit Inference [57.88785630755165]
Contextual bandit algorithms are increasingly replacing non-adaptive A/B tests in e-commerce, healthcare, and policymaking. They can both improve outcomes for study participants and increase the chance of identifying good or even best policies. To support credible inference on novel interventions at the end of the study, we still want to construct valid confidence intervals on average treatment effects, subgroup effects, or value of new policies.
arXiv Detail & Related papers (2021-06-01T12:01:51Z)
Rationalizing Text Matching: Learning Sparse Alignments via Optimal Transport [14.86310501896212]
In this work, we extend this selective rationalization approach to text matching. The goal is to jointly select and align text pieces, such as tokens or sentences, as a justification for the downstream prediction. Our approach employs optimal transport (OT) to find a minimal cost alignment between the inputs.
arXiv Detail & Related papers (2020-05-27T01:20:49Z)
The Strong Screening Rule for SLOPE [5.156484100374058]
We develop a screening rule for SLOPE by examining its subdifferential and show that this rule is a generalization of the strong rule for the lasso. Our numerical experiments show that the rule performs well in practice, leading to improvements by orders of magnitude for data in the $p gg n$ domain.
arXiv Detail & Related papers (2020-05-07T20:14:20Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.