Real-Time Visual Feedback to Guide Benchmark Creation: A
Human-and-Metric-in-the-Loop Workflow
- URL: http://arxiv.org/abs/2302.04434v1
- Date: Thu, 9 Feb 2023 04:43:10 GMT
- Title: Real-Time Visual Feedback to Guide Benchmark Creation: A
Human-and-Metric-in-the-Loop Workflow
- Authors: Anjana Arunkumar, Swaroop Mishra, Bhavdeep Sachdeva, Chitta Baral,
Chris Bryan
- Abstract summary: We propose VAIDA, a novel benchmark creation paradigm for NLP.
VAIDA focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies.
We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts.
- Score: 22.540665278228975
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent research has shown that language models exploit `artifacts' in
benchmarks to solve tasks, rather than truly learning them, leading to inflated
model performance. In pursuit of creating better benchmarks, we propose VAIDA,
a novel benchmark creation paradigm for NLP, that focuses on guiding
crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies.
VAIDA facilitates sample correction by providing realtime visual feedback and
recommendations to improve sample quality. Our approach is domain, model, task,
and metric agnostic, and constitutes a paradigm shift for robust, validated,
and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We
evaluate via expert review and a user study with NASA TLX. We find that VAIDA
decreases effort, frustration, mental, and temporal demands of crowdworkers and
analysts, simultaneously increasing the performance of both user groups with a
45.8% decrease in the level of artifacts in created samples. As a by product of
our user study, we observe that created samples are adversarial across models,
leading to decreases of 31.3% (BERT), 22.5% (RoBERTa), 14.98% (GPT-3 fewshot)
in performance.
Related papers
- LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - It's all about PR -- Smart Benchmarking AI Accelerators using Performance Representatives [40.197673152937256]
Training of statistical performance models often requires vast amounts of data, leading to a significant time investment and can be difficult in case of limited hardware availability.
We propose a novel performance modeling methodology that significantly reduces the number of training samples while maintaining good accuracy.
We achieve a Mean Absolute Percentage Error (MAPE) of as low as 0.02% for single-layer estimations and 0.68% for whole estimations with less than 10000 training samples.
arXiv Detail & Related papers (2024-06-12T15:34:28Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Re-ReST: Reflection-Reinforced Self-Training for Language Agents [101.22559705696885]
Self-training in language agents can generate supervision from the agent itself.
We present Reflection-Reinforced Self-Training (Re-ReST), which uses a textitreflector to refine low-quality generated samples.
arXiv Detail & Related papers (2024-06-03T16:21:38Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Feedback-guided Data Synthesis for Imbalanced Classification [10.836265321046561]
We introduce a framework for augmenting static datasets with useful synthetic samples.
We find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse.
On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes.
arXiv Detail & Related papers (2023-09-29T21:47:57Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Exploring the Efficacy of Automatically Generated Counterfactuals for
Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation.
A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.