Stronger Than You Think: Benchmarking Weak Supervision on Realistic Tasks
- URL: http://arxiv.org/abs/2501.07727v2
- Date: Thu, 30 Jan 2025 06:21:12 GMT
- Title: Stronger Than You Think: Benchmarking Weak Supervision on Realistic Tasks
- Authors: Tianyi Zhang, Linrong Cai, Jeffrey Li, Nicholas Roberts, Neel Guha, Jinoh Lee, Frederic Sala,
- Abstract summary: Weak supervision (WS) is a popular approach for label-efficient learning, leveraging diverse sources of noisy but inexpensive weak labels to automatically annotate training data.
Despite its wide usage, WS and its practical value are challenging to benchmark due to the many knobs in its setup.
We introduce a new benchmark, BOXWRENCH, designed to more accurately reflect real-world usages of WS.
- Score: 19.49705185032905
- License:
- Abstract: Weak supervision (WS) is a popular approach for label-efficient learning, leveraging diverse sources of noisy but inexpensive weak labels to automatically annotate training data. Despite its wide usage, WS and its practical value are challenging to benchmark due to the many knobs in its setup, including: data sources, labeling functions (LFs), aggregation techniques (called label models), and end model pipelines. Existing evaluation suites tend to be limited, focusing on particular components or specialized use cases. Moreover, they often involve simplistic benchmark tasks or de-facto LF sets that are suboptimally written, producing insights that may not generalize to real-world settings. We address these limitations by introducing a new benchmark, BOXWRENCH, designed to more accurately reflect real-world usages of WS. This benchmark features tasks with (1) higher class cardinality and imbalance, (2) notable domain expertise requirements, and (3) opportunities to re-use LFs across parallel multilingual corpora. For all tasks, LFs are written using a careful procedure aimed at mimicking real-world settings. In contrast to existing WS benchmarks, we show that supervised learning requires substantial amounts (1000+) of labeled examples to match WS in many settings.
Related papers
- Enhancing Unsupervised Graph Few-shot Learning via Set Functions and Optimal Transport [23.36436403062214]
Recent advancements in graph few-shot learning models have exhibited superior performance across diverse applications.
We propose a novel model named STAR, which enhances unsupervised graph few-shot learning.
arXiv Detail & Related papers (2025-01-10T00:42:27Z) - MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups.
It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics.
With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension [62.40482764691584]
We introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating textbftext-rich visual comprehension of MLLMs.
Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs.
We conduct a thorough evaluation involving 34 prominent MLLMs and emphasize the current limitations of MLLMs in text-rich visual comprehension.
arXiv Detail & Related papers (2024-04-25T17:39:35Z) - Universal Self-Adaptive Prompting [60.67460565566514]
Universal Self-Adaptive Prompting (USP) is an automatic prompt design approach specifically tailored for zero-shot learning.
USP is highly versatile: to achieve universal prompting, USP categorizes a possible NLP task into one of the three possible task types.
We evaluate USP with PaLM and PaLM 2 models and demonstrate performances that are considerably stronger than standard zero-shot baselines.
arXiv Detail & Related papers (2023-05-24T09:09:48Z) - Towards Practical Few-Shot Query Sets: Transductive Minimum Description
Length Inference [0.0]
We introduce a PrimAl Dual Minimum Description LEngth (PADDLE) formulation, which balances data-fitting accuracy and model complexity for a given few-shot task.
Our constrained MDL-like objective promotes competition among a large set of possible classes, preserving only effective classes that befit better the data of a few-shot task.
arXiv Detail & Related papers (2022-10-26T08:06:57Z) - AutoWS-Bench-101: Benchmarking Automated Weak Supervision with 100
Labels [23.849748213613452]
We introduce AutoWS-Bench-101: a framework for evaluating automated WS techniques in challenging WS settings.
We ask whether a practitioner should use an AutoWS method to generate additional labels or use some simpler baselines.
We conclude with a thorough ablation study of AutoWS methods.
arXiv Detail & Related papers (2022-08-30T16:09:42Z) - Low Resource Pipeline for Spoken Language Understanding via Weak
Supervision [5.9901156966011975]
In Weak Supervised Learning (WSL), a model is trained over noisy labels obtained from semantic rules and task-specific pre-trained models.
We show that task-agnostic prompts are generalizable and can be used to obtain noisy labels for different Spoken Language Understanding (SLU) tasks.
We demonstrate that prompt-based methods generate reliable labels for the above SLU tasks and thus can be used as a universal weak source to train a weak-supervised model (WSM) in absence of labeled data.
arXiv Detail & Related papers (2022-06-21T17:36:31Z) - WRENCH: A Comprehensive Benchmark for Weak Supervision [66.82046201714766]
benchmark consists of 22 varied real-world datasets for classification and sequence tagging.
We use benchmark to conduct extensive comparisons over more than 100 method variants to demonstrate its efficacy as a benchmark platform.
arXiv Detail & Related papers (2021-09-23T13:47:16Z) - KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text.
We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness.
Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.