Related papers: Scaling Up Active Testing to Large Language Models

Scaling Up Active Testing to Large Language Models

URL: http://arxiv.org/abs/2508.09093v1
Date: Tue, 12 Aug 2025 17:17:51 GMT
Title: Scaling Up Active Testing to Large Language Models
Authors: Gabrielle Berrada, Jannik Kossen, Muhammed Razzak, Freddie Bickford Smith, Yarin Gal, Tom Rainforth,
Abstract summary: We show how it can be successfully scaled up to the evaluation of large language models.<n>In particular we show that the surrogate model used to guide data acquisition can be constructed cheaply using in-context learning.
Score: 45.13194096236772
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Active testing enables label-efficient evaluation of models through careful data acquisition. However, its significant computational costs have previously undermined its use for large models. We show how it can be successfully scaled up to the evaluation of large language models (LLMs). In particular we show that the surrogate model used to guide data acquisition can be constructed cheaply using in-context learning, does not require updating within an active-testing loop, and can be smaller than the target model. We even find we can make good data-acquisition decisions without computing predictions with the target model and further introduce a single-run error estimator to asses how well active testing is working on the fly. We find that our approach is able to more effectively evaluate LLM performance with less data than current standard practices.

Related papers

Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach [65.6966065843227]
Iterative Reweight-then-IRO is a framework that performs RL-style alignment of a frozen base model without touching its parameters.<n>At test time, the value functions are used to guide the base model generation via a search-based optimization process.<n> Notably, users can apply IRO to align a model on their own dataset, similar to OpenAI's reinforcement fine-tuning (RFT)
arXiv Detail & Related papers (2025-06-21T21:49:02Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Language Models can Self-Improve at State-Value Estimation for Better Search [16.933525465335524]
We present self-taught look (STL), a self-ahead method that leverages state-transition dynamics to improve a value model.<n>We find that specialized value models learned with STL can be deployed with computationally lightweight search algorithms, achieving performance that matches that of more expensive tree search methods.
arXiv Detail & Related papers (2025-03-04T18:58:11Z)
Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures. We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z)
Iterative Loop Learning Combining Self-Training and Active Learning for Domain Adaptive Semantic Segmentation [1.827510863075184]
Self-training and active learning have been proposed to alleviate this problem. This paper proposes an iterative loop learning method combining Self-Training and Active Learning.
arXiv Detail & Related papers (2023-01-31T01:31:43Z)
Active Surrogate Estimators: An Active Learning Approach to Label-Efficient Model Evaluation [59.7305309038676]
We propose Active Surrogate Estimators (ASEs) for model evaluation. We find that ASEs offer greater label-efficiency than the current state-of-the-art.
arXiv Detail & Related papers (2022-02-14T17:15:18Z)
Practical Active Learning with Model Selection for Small Data [13.128648437690224]
We develop a simple and fast method for practical active learning with model selection. Our method is based on an underlying pool-based active learner for binary classification using support vector classification with a radial basis function kernel.
arXiv Detail & Related papers (2021-12-21T23:11:27Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
Active Testing: Sample-Efficient Model Evaluation [39.200332879659456]
We introduce active testing: a new framework for sample-efficient model evaluation. Active testing addresses this by carefully selecting the test points to label. We show how to remove that bias while reducing the variance of the estimator.
arXiv Detail & Related papers (2021-03-09T10:20:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.