Stronger Than You Think: Benchmarking Weak Supervision on Realistic Tasks
- URL: http://arxiv.org/abs/2501.07727v2
- Date: Thu, 30 Jan 2025 06:21:12 GMT
- Title: Stronger Than You Think: Benchmarking Weak Supervision on Realistic Tasks
- Authors: Tianyi Zhang, Linrong Cai, Jeffrey Li, Nicholas Roberts, Neel Guha, Jinoh Lee, Frederic Sala,
- Abstract summary: Weak supervision (WS) is a popular approach for label-efficient learning, leveraging diverse sources of noisy but inexpensive weak labels to automatically annotate training data.<n>Despite its wide usage, WS and its practical value are challenging to benchmark due to the many knobs in its setup.<n>We introduce a new benchmark, BOXWRENCH, designed to more accurately reflect real-world usages of WS.
- Score: 19.49705185032905
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weak supervision (WS) is a popular approach for label-efficient learning, leveraging diverse sources of noisy but inexpensive weak labels to automatically annotate training data. Despite its wide usage, WS and its practical value are challenging to benchmark due to the many knobs in its setup, including: data sources, labeling functions (LFs), aggregation techniques (called label models), and end model pipelines. Existing evaluation suites tend to be limited, focusing on particular components or specialized use cases. Moreover, they often involve simplistic benchmark tasks or de-facto LF sets that are suboptimally written, producing insights that may not generalize to real-world settings. We address these limitations by introducing a new benchmark, BOXWRENCH, designed to more accurately reflect real-world usages of WS. This benchmark features tasks with (1) higher class cardinality and imbalance, (2) notable domain expertise requirements, and (3) opportunities to re-use LFs across parallel multilingual corpora. For all tasks, LFs are written using a careful procedure aimed at mimicking real-world settings. In contrast to existing WS benchmarks, we show that supervised learning requires substantial amounts (1000+) of labeled examples to match WS in many settings.
Related papers
- Pushing the boundary on Natural Language Inference [49.15148871877941]
Natural Language Inference (NLI) is a central task in natural language understanding with applications in fact-checking, question answering and information retrieval.
Despite its importance, current NLI systems heavily rely on learning with limiting artifacts and biases, inference and real-world applicability.
This work provides a framework for building robust NLI systems without sacrificing quality or real-world applicability.
arXiv Detail & Related papers (2025-04-25T14:20:57Z) - Enhancing Unsupervised Graph Few-shot Learning via Set Functions and Optimal Transport [23.36436403062214]
Recent advancements in graph few-shot learning models have exhibited superior performance across diverse applications.
We propose a novel model named STAR, which enhances unsupervised graph few-shot learning.
arXiv Detail & Related papers (2025-01-10T00:42:27Z) - AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? [55.14033256706175]
Large Vision-Language Models (LVLMs) have become essential for advancing the integration of visual and linguistic information.
We introduce AutoBench-V, an automated framework for serving evaluation on demand.
Through an extensive evaluation of seven popular LVLMs across five demanded user inputs, the framework shows effectiveness and reliability.
arXiv Detail & Related papers (2024-10-28T17:55:08Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models [26.590755599827993]
We present OpsEval, a comprehensive task-oriented Ops benchmark designed for large language models (LLMs)
The benchmark includes 7184 multi-choice questions and 1736 question-answering (QA) formats in English and Chinese.
To ensure the credibility of our evaluation, we invite dozens of domain experts to manually review our questions.
arXiv Detail & Related papers (2023-10-11T16:33:29Z) - Universal Self-Adaptive Prompting [60.67460565566514]
Universal Self-Adaptive Prompting (USP) is an automatic prompt design approach specifically tailored for zero-shot learning.
USP is highly versatile: to achieve universal prompting, USP categorizes a possible NLP task into one of the three possible task types.
We evaluate USP with PaLM and PaLM 2 models and demonstrate performances that are considerably stronger than standard zero-shot baselines.
arXiv Detail & Related papers (2023-05-24T09:09:48Z) - Towards Practical Few-Shot Query Sets: Transductive Minimum Description
Length Inference [0.0]
We introduce a PrimAl Dual Minimum Description LEngth (PADDLE) formulation, which balances data-fitting accuracy and model complexity for a given few-shot task.
Our constrained MDL-like objective promotes competition among a large set of possible classes, preserving only effective classes that befit better the data of a few-shot task.
arXiv Detail & Related papers (2022-10-26T08:06:57Z) - AutoWS-Bench-101: Benchmarking Automated Weak Supervision with 100
Labels [23.849748213613452]
We introduce AutoWS-Bench-101: a framework for evaluating automated WS techniques in challenging WS settings.
We ask whether a practitioner should use an AutoWS method to generate additional labels or use some simpler baselines.
We conclude with a thorough ablation study of AutoWS methods.
arXiv Detail & Related papers (2022-08-30T16:09:42Z) - Low Resource Pipeline for Spoken Language Understanding via Weak
Supervision [5.9901156966011975]
In Weak Supervised Learning (WSL), a model is trained over noisy labels obtained from semantic rules and task-specific pre-trained models.
We show that task-agnostic prompts are generalizable and can be used to obtain noisy labels for different Spoken Language Understanding (SLU) tasks.
We demonstrate that prompt-based methods generate reliable labels for the above SLU tasks and thus can be used as a universal weak source to train a weak-supervised model (WSM) in absence of labeled data.
arXiv Detail & Related papers (2022-06-21T17:36:31Z) - Few-Shot Class-Incremental Learning by Sampling Multi-Phase Tasks [59.12108527904171]
A model should recognize new classes and maintain discriminability over old classes.
The task of recognizing few-shot new classes without forgetting old classes is called few-shot class-incremental learning (FSCIL)
We propose a new paradigm for FSCIL based on meta-learning by LearnIng Multi-phase Incremental Tasks (LIMIT)
arXiv Detail & Related papers (2022-03-31T13:46:41Z) - WRENCH: A Comprehensive Benchmark for Weak Supervision [66.82046201714766]
benchmark consists of 22 varied real-world datasets for classification and sequence tagging.
We use benchmark to conduct extensive comparisons over more than 100 method variants to demonstrate its efficacy as a benchmark platform.
arXiv Detail & Related papers (2021-09-23T13:47:16Z) - KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text.
We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness.
Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.