Instance-level Performance Prediction for Long-form Generation Tasks
- URL: http://arxiv.org/abs/2509.07309v1
- Date: Tue, 09 Sep 2025 00:59:34 GMT
- Title: Instance-level Performance Prediction for Long-form Generation Tasks
- Authors: Chi-Yang Hsu, Alexander Braylan, Yiheng Su, Omar Alonso, Matthew Lease,
- Abstract summary: We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks with fine-grained quality metrics.<n>Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs.
- Score: 47.21442052294225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks having multi-faceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples. Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.
Related papers
- Prescriptive Scaling Reveals the Evolution of Language Model Capabilities [22.14002750185524]
We estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs.<n>We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases.<n>We introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget.
arXiv Detail & Related papers (2026-02-17T03:13:51Z) - Accuracy Law for the Future of Deep Time Series Forecasting [65.46625911002202]
Time series forecasting inherently faces a non-zero error lower bound due to its partially observable and uncertain nature.<n>This paper focuses on a fundamental question: how to estimate the performance upper bound of deep time series forecasting.<n>Based on rigorous statistical tests of over 2,800 newly trained deep forecasters, we discover a significant exponential relationship between the minimum forecasting error of deep models and the complexity of window-wise series patterns.
arXiv Detail & Related papers (2025-10-03T05:18:47Z) - fev-bench: A Realistic Benchmark for Time Series Forecasting [19.931138737002215]
Existing benchmarks often have narrow domain coverage or overlook important real-world settings.<n>We propose fevbench, a benchmark comprising 100 forecasting tasks across seven domains.<n> fev-bench employs principled aggregation methods with bootstrapped confidence intervals to report model performance.
arXiv Detail & Related papers (2025-09-30T16:17:18Z) - TimeRecipe: A Time-Series Forecasting Recipe via Benchmarking Module Level Effectiveness [23.143208640116253]
TimeRecipe is a framework that systematically evaluates time-series forecasting methods at the module level.<n>TimeRecipe conducts over 10,000 experiments to assess the effectiveness of individual components.<n>Our results reveal that exhaustive exploration of the design space can yield models that outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2025-06-06T19:11:48Z) - Statistical Uncertainty Quantification for Aggregate Performance Metrics in Machine Learning Benchmarks [0.0]
We show how statistical methodology can be used for quantifying uncertainty in metrics that have been aggregated across multiple tasks.<n>These techniques reveal insights such as the dominance of a specific model for certain types of tasks despite an overall poor performance.
arXiv Detail & Related papers (2025-01-08T02:17:34Z) - P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation [90.53485251837235]
Time series foundation models excel in zero-shot forecasting, handling diverse tasks without explicit training.
GIFT-Eval is a pioneering benchmark aimed at promoting evaluation across diverse datasets.
GIFT-Eval encompasses 23 datasets over 144,000 time series and 177 million data points.
arXiv Detail & Related papers (2024-10-14T11:29:38Z) - HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting? [1.3654846342364308]
We introduce HoTPP, the first benchmark specifically designed to rigorously evaluate long-horizon predictions.<n>We identify shortcomings in widely used evaluation metrics, propose a theoretically grounded T-mAP metric, and offer efficient implementations of popular models.<n>We analyze the impact of autoregression and intensity-based losses on prediction quality, and outline promising directions for future research.
arXiv Detail & Related papers (2024-06-20T14:09:00Z) - PromptCast: A New Prompt-based Learning Paradigm for Time Series
Forecasting [11.670324826998968]
In existing time series forecasting methods, the models take a sequence of numerical values as input and yield numerical values as output.
Inspired by the successes of pre-trained language foundation models, we propose a new forecasting paradigm: prompt-based time series forecasting.
In this novel task, the numerical input and output are transformed into prompts and the forecasting task is framed in a sentence-to-sentence manner.
arXiv Detail & Related papers (2022-09-20T10:15:35Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Few-shot Learning for Time-series Forecasting [40.58524521473793]
We propose a few-shot learning method that forecasts a future value of a time-series in a target task given a few time-series in the target task.
Our model is trained using time-series data in multiple training tasks that are different from target tasks.
arXiv Detail & Related papers (2020-09-30T01:32:22Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.