WRENCH: A Comprehensive Benchmark for Weak Supervision
- URL: http://arxiv.org/abs/2109.11377v1
- Date: Thu, 23 Sep 2021 13:47:16 GMT
- Title: WRENCH: A Comprehensive Benchmark for Weak Supervision
- Authors: Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang,
Alexander Ratner
- Abstract summary: benchmark consists of 22 varied real-world datasets for classification and sequence tagging.
We use benchmark to conduct extensive comparisons over more than 100 method variants to demonstrate its efficacy as a benchmark platform.
- Score: 66.82046201714766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent \emph{Weak Supervision (WS)} approaches have had widespread success in
easing the bottleneck of labeling training data for machine learning by
synthesizing labels from multiple potentially noisy supervision sources.
However, proper measurement and analysis of these approaches remain a
challenge. First, datasets used in existing works are often private and/or
custom, limiting standardization. Second, WS datasets with the same name and
base data often vary in terms of the labels and weak supervision sources used,
a significant "hidden" source of evaluation variance. Finally, WS studies often
diverge in terms of the evaluation protocol and ablations used. To address
these problems, we introduce a benchmark platform, \benchmark, for a thorough
and standardized evaluation of WS approaches. It consists of 22 varied
real-world datasets for classification and sequence tagging; a range of real,
synthetic, and procedurally-generated weak supervision sources; and a modular,
extensible framework for WS evaluation, including implementations for popular
WS methods. We use \benchmark to conduct extensive comparisons over more than
100 method variants to demonstrate its efficacy as a benchmark platform. The
code is available at \url{https://github.com/JieyuZ2/wrench}.
Related papers
- TSGBench: Time Series Generation Benchmark [11.199605025284185]
textsfTSGBench is a unified and comprehensive assessment of Synthetic Time Series Generation methods.
It comprises three modules: (1) a curated collection of publicly available, real-world datasets tailored for TSG, together with a standardized preprocessing pipeline; (2) a comprehensive evaluation measures suite including vanilla measures, new distance-based assessments, and visualization tools; and (3) a pioneering generalization test rooted in Domain Adaptation (DA)
We have conducted experiments using textsfTSGBench across a spectrum of ten real-world datasets from diverse domains, utilizing ten advanced TSG methods and twelve evaluation measures.
arXiv Detail & Related papers (2023-09-07T14:51:42Z) - DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection [55.70982767084996]
A critical yet frequently overlooked challenge in the field of deepfake detection is the lack of a standardized, unified, comprehensive benchmark.
We present the first comprehensive benchmark for deepfake detection, called DeepfakeBench, which offers three key contributions.
DeepfakeBench contains 15 state-of-the-art detection methods, 9CL datasets, a series of deepfake detection evaluation protocols and analysis tools, as well as comprehensive evaluations.
arXiv Detail & Related papers (2023-07-04T01:34:41Z) - Lifting Weak Supervision To Structured Prediction [12.219011764895853]
Weak supervision (WS) is a rich set of techniques that produce pseudolabels by aggregating easily obtained but potentially noisy label estimates.
We introduce techniques new to weak supervision based on pseudo-Euclidean embeddings and tensor decompositions.
Several of our results, which can be viewed as robustness guarantees in structured prediction with noisy labels, may be of independent interest.
arXiv Detail & Related papers (2022-11-24T02:02:58Z) - Binary Classification with Positive Labeling Sources [71.37692084951355]
We propose WEAPO, a simple yet competitive WS method for producing training labels without negative labeling sources.
We show WEAPO achieves the highest averaged performance on 10 benchmark datasets.
arXiv Detail & Related papers (2022-08-02T19:32:08Z) - Extending the WILDS Benchmark for Unsupervised Adaptation [186.90399201508953]
We present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data.
These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities.
We systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods.
arXiv Detail & Related papers (2021-12-09T18:32:38Z) - Semi-Supervised Domain Generalization with Stochastic StyleMatch [90.98288822165482]
In real-world applications, we might have only a few labels available from each source domain due to high annotation cost.
In this work, we investigate semi-supervised domain generalization, a more realistic and practical setting.
Our proposed approach, StyleMatch, is inspired by FixMatch, a state-of-the-art semi-supervised learning method based on pseudo-labeling.
arXiv Detail & Related papers (2021-06-01T16:00:08Z) - Back to Square One: Bias Detection, Training and Commonsense
Disentanglement in the Winograd Schema [106.79804048131253]
The Winograd (WS) has been proposed as a test for measuring commonsense capabilities of models.
We show that the current evaluation method of WS is sub-optimal and propose a modification that makes use of twin sentences for evaluation.
We conclude that much of the apparent progress on WS may not necessarily reflect progress in commonsense reasoning.
arXiv Detail & Related papers (2021-04-16T15:17:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.