OrdRankBen: A Novel Ranking Benchmark for Ordinal Relevance in NLP
- URL: http://arxiv.org/abs/2503.00674v1
- Date: Sun, 02 Mar 2025 00:28:55 GMT
- Title: OrdRankBen: A Novel Ranking Benchmark for Ordinal Relevance in NLP
- Authors: Yan Wang, Lingfei Qian, Xueqing Peng, Jimin Huang, Dongji Feng,
- Abstract summary: Benchmark datasets play a crucial role in providing standardized testbeds that ensure fair comparisons.<n>Existing NLP ranking benchmarks typically use binary relevance labels or continuous relevance scores, neglecting ordinal relevance scores.<n>We introduce OrdRankBen, a novel benchmark designed to capture multi-granularity relevance distinctions.
- Score: 6.6002656593260225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The evaluation of ranking tasks remains a significant challenge in natural language processing (NLP), particularly due to the lack of direct labels for results in real-world scenarios. Benchmark datasets play a crucial role in providing standardized testbeds that ensure fair comparisons, enhance reproducibility, and enable progress tracking, facilitating rigorous assessment and continuous improvement of ranking models. Existing NLP ranking benchmarks typically use binary relevance labels or continuous relevance scores, neglecting ordinal relevance scores. However, binary labels oversimplify relevance distinctions, while continuous scores lack a clear ordinal structure, making it challenging to capture nuanced ranking differences effectively. To address these challenges, we introduce OrdRankBen, a novel benchmark designed to capture multi-granularity relevance distinctions. Unlike conventional benchmarks, OrdRankBen incorporates structured ordinal labels, enabling more precise ranking evaluations. Given the absence of suitable datasets for ordinal relevance ranking in NLP, we constructed two datasets with distinct ordinal label distributions. We further evaluate various models for three model types, ranking-based language models, general large language models, and ranking-focused large language models on these datasets. Experimental results show that ordinal relevance modeling provides a more precise evaluation of ranking models, improving their ability to distinguish multi-granularity differences among ranked items-crucial for tasks that demand fine-grained relevance differentiation.
Related papers
- A comparative analysis of rank aggregation methods for the partial label ranking problem [10.994154016400147]
The label ranking problem is a supervised learning scenario in which the learner predicts a total order of the class labels for a given input instance.<n>This paper explores several alternative aggregation methods for this critical step, including scoring-based and probabilistic-based rank aggregation approaches.
arXiv Detail & Related papers (2025-02-24T11:44:43Z) - Learning when to rank: Estimation of partial rankings from sparse, noisy comparisons [0.0]
We develop a principled Bayesian methodology for learning partial rankings.<n>Our framework is adaptable to any statistical ranking method.<n>It gives a more parsimonious summary of the data than traditional ranking.
arXiv Detail & Related papers (2025-01-05T11:04:30Z) - Splitting criteria for ordinal decision trees: an experimental study [6.575723870852787]
Ordinal Classification (OC) is a machine learning field that addresses classification tasks where the labels exhibit a natural order.<n>OC takes the ordinal relationship into account, producing more accurate and relevant results.<n>This work conducts an experimental study of tree-based methodologies designed to capture ordinal relationships.
arXiv Detail & Related papers (2024-12-18T10:41:44Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - RankingSHAP -- Listwise Feature Attribution Explanations for Ranking Models [48.895510739010355]
We present three key contributions to address this gap.
First, we rigorously define listwise feature attribution for ranking models.
Second, we introduce RankingSHAP, extending the popular SHAP framework to accommodate listwise ranking attribution.
Third, we propose two novel evaluation paradigms for assessing the faithfulness of attributions in learning-to-rank models.
arXiv Detail & Related papers (2024-03-24T10:45:55Z) - Bipartite Ranking Fairness through a Model Agnostic Ordering Adjustment [54.179859639868646]
We propose a model agnostic post-processing framework xOrder for achieving fairness in bipartite ranking.
xOrder is compatible with various classification models and ranking fairness metrics, including supervised and unsupervised fairness metrics.
We evaluate our proposed algorithm on four benchmark data sets and two real-world patient electronic health record repositories.
arXiv Detail & Related papers (2023-07-27T07:42:44Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Statistical Comparisons of Classifiers by Generalized Stochastic
Dominance [0.0]
There is still no consensus on how to compare classifiers over multiple data sets with respect to several criteria.
In this paper, we add a fresh view to the vivid debate by adopting recent developments in decision theory.
We show that our framework ranks classifiers by a generalized concept of dominance, which powerfully circumvents the cumbersome, and often even self-contradictory, reliance on aggregates.
arXiv Detail & Related papers (2022-09-05T09:28:15Z) - Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks.
We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - Document Ranking with a Pretrained Sequence-to-Sequence Model [56.44269917346376]
We show how a sequence-to-sequence model can be trained to generate relevance labels as "target words"
Our approach significantly outperforms an encoder-only model in a data-poor regime.
arXiv Detail & Related papers (2020-03-14T22:29:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.