Related papers: Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks

Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks

URL: http://arxiv.org/abs/2507.09871v2
Date: Wed, 23 Jul 2025 20:53:29 GMT
Title: Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks
Authors: Niket Patel, Randall Balestriero,
Abstract summary: We argue that such a rigid evaluation protocol creates a silent bottleneck in AI research.<n>Under this view, one can evaluate a model's performance over the set of all possible downstream tasks.
Score: 13.412573082645096
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The grand goal of AI research, and particularly Self Supervised Learning (SSL), is to produce systems that can successfully solve any possible task. In contrast, current evaluation methods available to AI researchers typically rely on a fixed collection of hand-picked downstream benchmarks. Hence, a large amount of effort is put into designing and searching for large collection of evaluation tasks that can serve as a proxy of our grand goal. We argue that such a rigid evaluation protocol creates a silent bottleneck in AI research. To remedy that, we define a probabilistic space of downstream tasks obtained by adopting a distribution of tasks and by defining Task Priors. Under this view, one can evaluate a model's performance over the set of all possible downstream tasks. Our framework is the first to provide answers to key questions such as (i) what is the average performance of my model over all possible downstream tasks weighted by the probability to encounter each task? or (ii) what is the variance of my model's performance across all downstream tasks under the defined Task Priors? Beyond establishing a new standard for evaluation, we believe that Task Priors will accelerate the pace of research in SSL - where downstream task evaluation is the sole qualitative signal that researchers have access to.

Related papers

Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models [71.34520793462069]
Unsupervised reinforcement learning (RL) aims at pre-training agents that can solve a wide range of downstream tasks in complex environments.<n>We introduce a novel algorithm regularizing unsupervised RL towards imitating trajectories from unlabeled behavior datasets.<n>We demonstrate the effectiveness of this new approach in a challenging humanoid control problem.
arXiv Detail & Related papers (2025-04-15T10:41:11Z)
Model Predictive Task Sampling for Efficient and Robust Adaptation [46.92143725900031]
We introduce Model Predictive Task Sampling (MPTS), a framework that bridges the task space and adaptation risk landscape.<n>MPTS employs a generative model to characterize the episodic optimization process and predicts task-specific adaptation risk via posterior inference.<n>MPTS seamlessly integrates into zero-shot, few-shot, and supervised finetuning settings.
arXiv Detail & Related papers (2025-01-19T13:14:53Z)
Class Incremental Learning via Likelihood Ratio Based Task Prediction [20.145128455767587]
An emerging theory-guided approach is to train a task-specific model for each task in a shared network for all tasks. This paper argues that using a traditional OOD detector for task-id prediction is sub-optimal because additional information can be exploited. We call the new method TPL (Task-id Prediction based on Likelihood Ratio) It markedly outperforms strong CIL baselines and has negligible catastrophic forgetting.
arXiv Detail & Related papers (2023-09-26T16:25:57Z)
CrossCodeBench: Benchmarking Cross-Task Generalization of Source Code Models [33.78307982736911]
Cross-task generalization is of strong research and application value. We propose a large-scale benchmark that includes 216 existing code-related tasks.
arXiv Detail & Related papers (2023-02-08T13:04:52Z)
Reinforcement Learning with Success Induced Task Prioritization [68.8204255655161]
We introduce Success Induced Task Prioritization (SITP), a framework for automatic curriculum learning. The algorithm selects the order of tasks that provide the fastest learning for agents. We demonstrate that SITP matches or surpasses the results of other curriculum design methods.
arXiv Detail & Related papers (2022-12-30T12:32:43Z)
Provable Benefits of Representational Transfer in Reinforcement Learning [59.712501044999875]
We study the problem of representational transfer in RL, where an agent first pretrains in a number of source tasks to discover a shared representation. We show that given generative access to source tasks, we can discover a representation, using which subsequent linear RL techniques quickly converge to a near-optimal policy.
arXiv Detail & Related papers (2022-05-29T04:31:29Z)
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)
Evaluating NLP Systems On a Novel Cloze Task: Judging the Plausibility of Possible Fillers in Instructional Texts [2.3449131636069898]
Cloze task is a widely used task to evaluate an NLP system's language understanding ability. New task is proposed: predicting if a filler word in a cloze task is a good, neutral, or bad candidate.
arXiv Detail & Related papers (2021-12-03T12:02:52Z)
Exploring and Predicting Transferability across NLP Tasks [115.6278033699853]
We study the transferability between 33 NLP tasks across three broad classes of problems. Our results show that transfer learning is more beneficial than previously thought. We also develop task embeddings that can be used to predict the most transferable source tasks for a given target task.
arXiv Detail & Related papers (2020-05-02T09:39:36Z)
Hierarchical Reinforcement Learning as a Model of Human Task Interleaving [60.95424607008241]
We develop a hierarchical model of supervisory control driven by reinforcement learning. The model reproduces known empirical effects of task interleaving. The results support hierarchical RL as a plausible model of task interleaving.
arXiv Detail & Related papers (2020-01-04T17:53:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.