Related papers: Instance-level Performance Prediction for Long-form Generation Tasks

Instance-level Performance Prediction for Long-form Generation Tasks

URL: http://arxiv.org/abs/2509.07309v1
Date: Tue, 09 Sep 2025 00:59:34 GMT
Title: Instance-level Performance Prediction for Long-form Generation Tasks
Authors: Chi-Yang Hsu, Alexander Braylan, Yiheng Su, Omar Alonso, Matthew Lease,
Abstract summary: We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks with fine-grained quality metrics.<n>Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs.
Score: 47.21442052294225
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks having multi-faceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples. Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.

Related papers

Prescriptive Scaling Reveals the Evolution of Language Model Capabilities [22.14002750185524]
We estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs.<n>We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases.<n>We introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget.
arXiv Detail & Related papers (2026-02-17T03:13:51Z)
Accuracy Law for the Future of Deep Time Series Forecasting [65.46625911002202]
Time series forecasting inherently faces a non-zero error lower bound due to its partially observable and uncertain nature.<n>This paper focuses on a fundamental question: how to estimate the performance upper bound of deep time series forecasting.<n>Based on rigorous statistical tests of over 2,800 newly trained deep forecasters, we discover a significant exponential relationship between the minimum forecasting error of deep models and the complexity of window-wise series patterns.
arXiv Detail & Related papers (2025-10-03T05:18:47Z)
fev-bench: A Realistic Benchmark for Time Series Forecasting [19.931138737002215]
Existing benchmarks often have narrow domain coverage or overlook important real-world settings.<n>We propose fevbench, a benchmark comprising 100 forecasting tasks across seven domains.<n> fev-bench employs principled aggregation methods with bootstrapped confidence intervals to report model performance.
arXiv Detail & Related papers (2025-09-30T16:17:18Z)
TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness [23.143208640116253]
TimeRecipe is a framework that systematically evaluates time-series forecasting methods at the module level.<n>TimeRecipe conducts over 10,000 experiments to assess the effectiveness of individual components.<n>Our results reveal that exhaustive exploration of the design space can yield models that outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2025-06-06T19:11:48Z)
Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks [0.0]
We show how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks.<n>These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance.
arXiv Detail & Related papers (2025-01-08T02:17:34Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation [90.53485251837235]
Time series foundation models excel in zero-shot forecasting, handling diverse tasks without explicit training. GIFT-Eval is a pioneering benchmark aimed at promoting evaluation across diverse datasets. GIFT-Eval encompasses 23 datasets over 144,000 time series and 177 million data points.
arXiv Detail & Related papers (2024-10-14T11:29:38Z)
HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting? [1.3654846342364308]
We introduce HoTPP, the first benchmark specifically designed to rigorously evaluate long-horizon predictions.<n>We identify shortcomings in widely used evaluation metrics, propose a theoretically grounded T-mAP metric, and offer efficient implementations of popular models.<n>We analyze the impact of autoregression and intensity-based losses on prediction quality, and outline promising directions for future research.
arXiv Detail & Related papers (2024-06-20T14:09:00Z)
PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting [11.670324826998968]
In existing time series forecasting methods, the models take a sequence of numerical values as input and yield numerical values as output. Inspired by the successes of pre-trained language foundation models, we propose a new forecasting paradigm: prompt-based time series forecasting. In this novel task, the numerical input and output are transformed into prompts and the forecasting task is framed in a sentence-to-sentence manner.
arXiv Detail & Related papers (2022-09-20T10:15:35Z)
A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities. We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention. Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z)
Learning to be a Statistician: Learned Estimator for Number of Distinct Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z)
Few-shot Learning for Time-series Forecasting [40.58524521473793]
We propose a few-shot learning method that forecasts a future value of a time-series in a target task given a few time-series in the target task. Our model is trained using time-series data in multiple training tasks that are different from target tasks.
arXiv Detail & Related papers (2020-09-30T01:32:22Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.