Adaptively evaluating models with task elicitation
- URL: http://arxiv.org/abs/2503.01986v1
- Date: Mon, 03 Mar 2025 19:04:10 GMT
- Title: Adaptively evaluating models with task elicitation
- Authors: Davis Brown, Prithvi Balehannina, Helen Jin, Shreya Havaldar, Hamed Hassani, Eric Wong,
- Abstract summary: We introduce and validate a framework for evaluating language models called Adaptive Evaluations.<n>We find that frontier models lack consistency when adaptively probed with our framework on a diverse suite of datasets and tasks.<n> Generated questions pass human validity checks and often transfer to other models with different capability profiles.
- Score: 29.704450391533864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Manual curation of evaluation datasets is struggling to keep up with the rapidly expanding capabilities and deployment scenarios of language models. Towards scalable model profiling, we introduce and validate a framework for evaluating LLMs, called Adaptive Evaluations. Adaptive evaluations use scaffolded language models (evaluator agents) to search through a target model's behavior on a domain dataset and create difficult questions (tasks) that can discover and probe the model's failure modes. We find that frontier models lack consistency when adaptively probed with our framework on a diverse suite of datasets and tasks, including but not limited to legal reasoning, forecasting, and online harassment. Generated questions pass human validity checks and often transfer to other models with different capability profiles, demonstrating that adaptive evaluations can also be used to create difficult domain-specific datasets.
Related papers
- Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities [55.87169702896249]
Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift.<n>We present a complete and fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment.<n>Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications.
arXiv Detail & Related papers (2024-07-16T12:52:29Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
OLMES is a documented, practical, open standard for reproducible language model evaluations.<n>It supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.<n> OLMES includes well-considered, documented recommendations guided by results from existing literature as well as new experiments resolving open questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - AttributionScanner: A Visual Analytics System for Model Validation with Metadata-Free Slice Finding [29.07617945233152]
Data slice finding is an emerging technique for validating machine learning (ML) models by identifying and analyzing subgroups in a dataset that exhibit poor performance.<n>This approach faces significant challenges, including the laborious and costly requirement for additional metadata.<n>We introduce AttributionScanner, an innovative human-in-the-loop Visual Analytics (VA) system, designed for metadata-free data slice finding.<n>Our system identifies interpretable data slices that involve common model behaviors and visualizes these patterns through an Attribution Mosaic design.
arXiv Detail & Related papers (2024-01-12T09:17:32Z) - Adapting Large Language Models for Content Moderation: Pitfalls in Data
Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains.
In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Earning Extra Performance from Restrictive Feedbacks [41.05874087063763]
We set up a challenge named emphEarning eXtra PerformancE from restriCTive feEDdbacks (EXPECTED) to describe this form of model tuning problems.
The goal of the model provider is to eventually deliver a satisfactory model to the local user(s) by utilizing the feedbacks.
We propose to characterize the geometry of the model performance with regard to model parameters through exploring the parameters' distribution.
arXiv Detail & Related papers (2023-04-28T13:16:54Z) - Variable Importance Matching for Causal Inference [73.25504313552516]
We describe a general framework called Model-to-Match that achieves these goals.
Model-to-Match uses variable importance measurements to construct a distance metric.
We operationalize the Model-to-Match framework with LASSO.
arXiv Detail & Related papers (2023-02-23T00:43:03Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - Assessing Out-of-Domain Language Model Performance from Few Examples [38.245449474937914]
We address the task of predicting out-of-domain (OOD) performance in a few-shot fashion.
We benchmark the performance on this task when looking at model accuracy on the few-shot examples.
We show that attribution-based factors can help rank relative model OOD performance.
arXiv Detail & Related papers (2022-10-13T04:45:26Z) - Calibrating Over-Parametrized Simulation Models: A Framework via
Eligibility Set [3.862247454265944]
We develop a framework to develop calibration schemes that satisfy rigorous frequentist statistical guarantees.
We demonstrate our methodology on several numerical examples, including an application to calibration of a limit order book market simulator.
arXiv Detail & Related papers (2021-05-27T00:59:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.