Related papers: AutoEval Done Right: Using Synthetic Data for Model Evaluation

AutoEval Done Right: Using Synthetic Data for Model Evaluation

URL: http://arxiv.org/abs/2403.07008v2
Date: Tue, 28 May 2024 04:38:41 GMT
Title: AutoEval Done Right: Using Synthetic Data for Model Evaluation
Authors: Pierre Boyeau, Anastasios N. Angelopoulos, Nir Yosef, Jitendra Malik, Michael I. Jordan,
Abstract summary: We suggest efficient and statistically principled algorithms for this purpose. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.
Score: 79.01454261157525
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.

Related papers

Adaptive Prediction-Powered AutoEval with Reliability and Efficiency Guarantees [36.407171992845456]
We propose textttR-AutoEval+, a novel framework that provides finite-sample reliability guarantees on the model evaluation.<n>The key innovation of textttR-AutoEval+ is an adaptive construction of the model evaluation variable, which dynamically tunes its reliance on synthetic data.
arXiv Detail & Related papers (2025-05-24T11:53:29Z)
How to Select Datapoints for Efficient Human Evaluation of NLG Models? [57.60407340254572]
We develop and analyze a suite of selectors to get the most informative datapoints for human evaluation.<n>We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection.<n>In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts.
arXiv Detail & Related papers (2025-01-30T10:33:26Z)
Auto-Evaluation with Few Labels through Post-hoc Regression [4.813376208491175]
Prediction Powered Inference (PPI) framework provides a way of leveraging statistical power of automatic evaluation and a small pool of labelled data. We present two new PPI-based techniques that leverage robust regressors to produce even lower variance estimators in the few-label regime.
arXiv Detail & Related papers (2024-11-19T17:17:46Z)
Little Giants: Synthesizing High-Quality Embedding Data at Scale [71.352883755806]
We introduce SPEED, a framework that aligns open-source small models to efficiently generate large-scale embedding data. SPEED uses only less than 1/10 of the GPT API calls, outperforming the state-of-the-art embedding model E5_mistral when both are trained solely on their synthetic data.
arXiv Detail & Related papers (2024-10-24T10:47:30Z)
Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations. We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer. Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
Synthetic Information towards Maximum Posterior Ratio for deep learning on Imbalanced Data [1.7495515703051119]
We propose a technique for data balancing by generating synthetic data for the minority class. Our method prioritizes balancing the informative regions by identifying high entropy samples. Our experimental results on forty-one datasets demonstrate the superior performance of our technique.
arXiv Detail & Related papers (2024-01-05T01:08:26Z)
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice. We investigate whether we can go beyond human data on tasks where we have access to scalar feedback. We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z)
Value function estimation using conditional diffusion models for control [62.27184818047923]
We propose a simple algorithm called Diffused Value Function (DVF) It learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model. We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers.
arXiv Detail & Related papers (2023-06-09T18:40:55Z)
Post-training Model Quantization Using GANs for Synthetic Data Generation [57.40733249681334]
We investigate the use of synthetic data as a substitute for the calibration with real data for the quantization method. We compare the performance of models quantized using data generated by StyleGAN2-ADA and our pre-trained DiStyleGAN, with quantization using real data and an alternative data generation method based on fractal images.
arXiv Detail & Related papers (2023-05-10T11:10:09Z)
Adaptive t-Momentum-based Optimization for Unknown Ratio of Outliers in Amateur Data in Imitation Learning [3.145455301228175]
Behavioral (BC) bears a high potential for safe and direct transfer of human skills to robots. In order to allow the imitators to effectively learn from imperfect demonstrations, we propose to employ the robust t-momentum optimization algorithm. We show empirically how the algorithm can be used to produce robust BC imitators against datasets with unknown heaviness.
arXiv Detail & Related papers (2021-08-02T04:30:41Z)
Human or Machine: Automating Human Likeliness Evaluation of NLG Texts [0.0]
We propose to use a human likeliness score that shows the percentage of the output samples from a method that look as if they were written by a human. As follow up, we plan to perform an empirical analysis of human-written and machine-generated texts to find the optimal setup of this evaluation approach.
arXiv Detail & Related papers (2020-06-05T00:57:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.