Related papers: Abnormal-aware Multi-person Evaluation System with Improved Fuzzy Weighting

Abnormal-aware Multi-person Evaluation System with Improved Fuzzy Weighting

URL: http://arxiv.org/abs/2205.00388v1
Date: Sun, 1 May 2022 03:42:43 GMT
Title: Abnormal-aware Multi-person Evaluation System with Improved Fuzzy Weighting
Authors: Shutong Ni
Abstract summary: We choose the two-stage screening method, which consists of rough screening and score-weighted Kendall-$tau$ Distance. We use Fuzzy Synthetic Evaluation Method(FSE) to determine the significance of scores given by reviewers as well as their reliability.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: There exists a phenomenon that subjectivity highly lies in the daily evaluation process. Our research primarily concentrates on a multi-person evaluation system with anomaly detection to minimize the possible inaccuracy that subjective assessment brings. We choose the two-stage screening method, which consists of rough screening and score-weighted Kendall-$\tau$ Distance to winnow out abnormal data, coupled with hypothesis testing to narrow global discrepancy. Then we use Fuzzy Synthetic Evaluation Method(FSE) to determine the significance of scores given by reviewers as well as their reliability, culminating in a more impartial weight for each reviewer in the final conclusion. The results demonstrate a clear and comprehensive ranking instead of unilateral scores, and we get to have an efficiency in filtering out abnormal data as well as a reasonably objective weight determination mechanism. We can sense that through our study, people will have a chance of modifying a multi-person evaluation system to attain both equity and a relatively superior competitive atmosphere.

Related papers

Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge [90.8674158031845]
We propose Crowd-based Comparative Evaluation, which introduces additional crowd responses to compare with the candidate responses. This process effectively guides LLM-as-a-Judge to provide a more detailed chain-of-thought (CoT) judgment. Our method produces higher-quality CoTs that facilitate judge distillation and exhibit superior performance in rejection sampling.
arXiv Detail & Related papers (2025-02-18T03:31:06Z)
Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy. However, the reliability of test accuracy as the primary performance metric has been called into question. The distribution of hard samples between training and test sets affects the difficulty levels of those sets. We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z)
Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling [50.08315607506652]
We propose a Constrained Active Sampling Framework (CASF) for reliable human judgment. Experiment results show CASF receives 93.18% top-ranked system recognition accuracy.
arXiv Detail & Related papers (2024-06-12T07:44:36Z)
On Pixel-level Performance Assessment in Anomaly Detection [87.7131059062292]
Anomaly detection methods have demonstrated remarkable success across various applications. However, assessing their performance, particularly at the pixel-level, presents a complex challenge. In this paper, we dissect the intricacies of this challenge, underscored by visual evidence and statistical analysis.
arXiv Detail & Related papers (2023-10-25T08:02:27Z)
Style Over Substance: Evaluation Biases for Large Language Models [17.13064447978519]
This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs) Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
arXiv Detail & Related papers (2023-07-06T14:42:01Z)
Evaluating AI systems under uncertain ground truth: a case study in dermatology [43.8328264420381]
We show that ignoring uncertainty leads to overly optimistic estimates of model performance. In skin condition classification, we find that a large portion of the dataset exhibits significant ground truth uncertainty.
arXiv Detail & Related papers (2023-07-05T10:33:45Z)
Score-balanced Loss for Multi-aspect Pronunciation Assessment [3.6825890616838066]
We propose a novel loss function, score-balanced loss, to address the problem caused by uneven data. As a re-weighting approach, we assign higher costs when the predicted score is of the minority class. We evaluate our method on the speechocean762 dataset, which has noticeably imbalanced scores for several aspects.
arXiv Detail & Related papers (2023-05-26T06:21:37Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification [0.491574468325115]
We present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation.
arXiv Detail & Related papers (2022-11-28T12:25:27Z)
Uncertainty-Driven Action Quality Assessment [67.20617610820857]
We propose a novel probabilistic model, named Uncertainty-Driven AQA (UD-AQA), to capture the diversity among multiple judge scores. We generate the estimation of uncertainty for each prediction, which is employed to re-weight AQA regression loss. Our proposed method achieves competitive results on three benchmarks including the Olympic events MTL-AQA and FineDiving, and the surgical skill JIGSAWS datasets.
arXiv Detail & Related papers (2022-07-29T07:21:15Z)
Estimating and Improving Fairness with Adversarial Learning [65.99330614802388]
We propose an adversarial multi-task training strategy to simultaneously mitigate and detect bias in the deep learning-based medical image analysis system. Specifically, we propose to add a discrimination module against bias and a critical module that predicts unfairness within the base classification model. We evaluate our framework on a large-scale public-available skin lesion dataset.
arXiv Detail & Related papers (2021-03-07T03:10:32Z)
Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data. There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups. We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.