Related papers: A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels

URL: http://arxiv.org/abs/2109.11126v1
Date: Thu, 23 Sep 2021 03:42:01 GMT
Title: A Framework for Cluster and Classifier Evaluation in the Absence of Reference Labels
Authors: Robert J. Joyce, Edward Raff, Charles Nicholas
Abstract summary: We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR) We prove that bounds on specific metrics used to evaluate clustering algorithms can be computed without reference labels. We also introduce a procedure that uses an AGTR to identify inaccurate evaluation results produced from datasets of dubious quality.
Score: 23.658440146240025
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In some problem spaces, the high cost of obtaining ground truth labels necessitates use of lower quality reference datasets. It is difficult to benchmark model performance using these datasets, as evaluation results may be biased. We propose a supplement to using reference labels, which we call an approximate ground truth refinement (AGTR). Using an AGTR, we prove that bounds on specific metrics used to evaluate clustering algorithms and multi-class classifiers can be computed without reference labels. We also introduce a procedure that uses an AGTR to identify inaccurate evaluation results produced from datasets of dubious quality. Creating an AGTR requires domain knowledge, and malware family classification is a task with robust domain knowledge approaches that support the construction of an AGTR. We demonstrate our AGTR evaluation framework by applying it to a popular malware labeling tool to diagnose over-fitting in prior testing and evaluate changes whose impact could not be meaningfully quantified under previous data.

Related papers

Retrieving Semantically Similar Decisions under Noisy Institutional Labels: Robust Comparison of Embedding Methods [0.0]
General-purpose embedder (OpenAI) outperforms domain-specific BERT-trained from scratch on 30,000 decisions.<n>Our framework is robust enough to be used for evaluation under a noisy gold dataset.
arXiv Detail & Related papers (2025-12-05T12:54:26Z)
Multi-Agent VLMs Guided Self-Training with PNU Loss for Low-Resource Offensive Content Detection [32.68131638705225]
We propose a self-training framework that leverages abundant unlabeled data through collaborative pseudo-labeling.<n>Our method iteratively assigns pseudo-labels to unlabeled instances with the support of Multi-Agent Vision-Language Models.<n>Experiments on benchmark datasets demonstrate that our framework substantially outperforms baselines under limited supervision.
arXiv Detail & Related papers (2025-11-14T08:03:35Z)
Quantifying Query Fairness Under Unawareness [82.33181164973365]
We introduce a robust fairness estimator based on quantification that effectively handles multiple sensitive attributes beyond binary classifications.<n>Our method outperforms existing baselines across various sensitive attributes and is the first to establish a reliable protocol for measuring fairness under unawareness.
arXiv Detail & Related papers (2025-06-04T16:31:44Z)
When VLMs Meet Image Classification: Test Sets Renovation via Missing Label Identification [11.49089004019603]
We present a comprehensive framework named REVEAL to address both noisy labels and missing labels in image classification test sets.<n> REVEAL detects potential noisy labels and omissions, aggregates predictions from various methods, and refines label accuracy through confidence-informed predictions and consensus-based filtering.<n>Our method effectively reveals missing labels from public datasets and provides soft-labeled results with likelihoods.
arXiv Detail & Related papers (2025-05-22T02:47:36Z)
SEOE: A Scalable and Reliable Semantic Evaluation Framework for Open Domain Event Detection [70.23196257213829]
We propose a scalable and reliable Semantic-level Evaluation framework for Open domain Event detection. Our proposed framework first constructs a scalable evaluation benchmark that currently includes 564 event types covering 7 major domains. We then leverage large language models (LLMs) as automatic evaluation agents to compute a semantic F1-score, incorporating fine-grained definitions of semantically similar labels.
arXiv Detail & Related papers (2025-03-05T09:37:05Z)
Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes [22.45812577928658]
We introduce a new framework for analyzing classification datasets based on the ratios of reconstruction errors between autoencoders trained on individual classes. This analysis framework enables efficient characterization of datasets on the sample, class, and entire dataset levels.
arXiv Detail & Related papers (2024-12-03T17:29:00Z)
Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards [5.632231145349045]
This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP. Existing relation extraction benchmarks often suffer from insufficient documentation and lack crucial details. While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well.
arXiv Detail & Related papers (2024-11-07T22:36:19Z)
DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking [58.37644304554906]
We propose Data Retrieval with Error-corrected codes and Watermarking (DREW) DREW randomly clusters the reference dataset and injects unique error-controlled watermark keys into each cluster. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches.
arXiv Detail & Related papers (2024-06-05T01:19:44Z)
Evaluating Retrieval Quality in Retrieval-Augmented Generation [21.115495457454365]
Traditional end-to-end evaluation methods are computationally expensive. We propose eRAG, where each document in the retrieval list is individually utilized by the large language model within the RAG system. eRAG offers significant computational advantages, improving runtime and consuming up to 50 times less GPU memory than end-to-end evaluation.
arXiv Detail & Related papers (2024-04-21T21:22:28Z)
Drawing the Same Bounding Box Twice? Coping Noisy Annotations in Object Detection with Repeated Labels [6.872072177648135]
We propose a novel localization algorithm that adapts well-established ground truth estimation methods. Our algorithm also shows superior performance during training on the TexBiG dataset.
arXiv Detail & Related papers (2023-09-18T13:08:44Z)
Parametric Classification for Generalized Category Discovery: A Baseline Study [70.73212959385387]
Generalized Category Discovery (GCD) aims to discover novel categories in unlabelled datasets using knowledge learned from labelled samples. We investigate the failure of parametric classifiers, verify the effectiveness of previous design choices when high-quality supervision is available, and identify unreliable pseudo-labels as a key problem. We propose a simple yet effective parametric classification method that benefits from entropy regularisation, achieves state-of-the-art performance on multiple GCD benchmarks and shows strong robustness to unknown class numbers.
arXiv Detail & Related papers (2022-11-21T18:47:11Z)
CEREAL: Few-Sample Clustering Evaluation [4.569028973407756]
We focus on the underexplored problem of estimating clustering quality with limited labels. We introduce CEREAL, a comprehensive framework for few-sample clustering evaluation. Our results show that CEREAL reduces the area under the absolute error curve by up to 57% compared to the best sampling baseline.
arXiv Detail & Related papers (2022-09-30T19:52:41Z)
Semi-Supervised Cascaded Clustering for Classification of Noisy Label Data [0.3441021278275805]
The performance of supervised classification techniques often deteriorates when the data has noisy labels. Most of the approaches addressing the noisy label data rely on deep neural networks (DNN) that require huge datasets for classification tasks. We propose a semi-supervised cascaded clustering algorithm to extract patterns and generate a cascaded tree of classes in such datasets.
arXiv Detail & Related papers (2022-05-04T17:42:22Z)
Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare. In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples. To tackle this problem, we build a robust one-class classification framework via data refinement. We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z)
Rethinking Pseudo Labels for Semi-Supervised Object Detection [84.697097472401]
We introduce certainty-aware pseudo labels tailored for object detection. We dynamically adjust the thresholds used to generate pseudo labels and reweight loss functions for each category to alleviate the class imbalance problem. Our approach improves supervised baselines by up to 10% AP using only 1-10% labeled data from COCO.
arXiv Detail & Related papers (2021-06-01T01:32:03Z)
ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications. We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN) We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z)
Exploiting Sample Uncertainty for Domain Adaptive Person Re-Identification [137.9939571408506]
We estimate and exploit the credibility of the assigned pseudo-label of each sample to alleviate the influence of noisy labels. Our uncertainty-guided optimization brings significant improvement and achieves the state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2020-12-16T04:09:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.