Related papers: Task Generalization With AutoRegressive Compositional Structure: Can Learning From $\d$ Tasks Generalize to $\d^{T}$ Tasks?

Task Generalization With AutoRegressive Compositional Structure: Can Learning From $\d$ Tasks Generalize to $\d^{T}$ Tasks?

URL: http://arxiv.org/abs/2502.08991v1
Date: Thu, 13 Feb 2025 06:08:01 GMT
Title: Task Generalization With AutoRegressive Compositional Structure: Can Learning From $\d$ Tasks Generalize to $\d^{T}$ Tasks?
Authors: Amirhesam Abedsoltan, Huaqing Zhang, Kaiyue Wen, Hongzhou Lin, Jingzhao Zhang, Mikhail Belkin,
Abstract summary: Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations.<n>This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family?<n>In this paper, we investigate task generalization through the lens of AutoRegressive Compositional (ARC) structure, where each task is a composition of $T$ operations, and each operation is among a finite family of $d$ subtasks.
Score: 23.597170816867077
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations. This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family? In this paper, we investigate task generalization through the lens of AutoRegressive Compositional (ARC) structure, where each task is a composition of $T$ operations, and each operation is among a finite family of $\d$ subtasks. This yields a total class of size~$ \d^\TT $. We first show that generalization to all $ \d^\TT $ tasks is theoretically achievable by training on only $ \tilde{O}(\d) $ tasks. Empirically, we demonstrate that Transformers achieve such exponential task generalization on sparse parity functions via in-context learning (ICL) and Chain-of-Thought (CoT) reasoning. We further demonstrate this generalization in arithmetic and language translation, extending beyond parity functions.

Related papers

Learning Compositional Functions with Transformers from Easy-to-Hard Data [63.96562216704653]
We study the learnability of the $k$-fold composition task, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations.<n>We show that this function class can be efficiently learned, with runtime and sample in $k$, by gradient descent on an $O(log k)$-depth transformer.
arXiv Detail & Related papers (2025-05-29T17:22:00Z)
Beyond Task Diversity: Provable Representation Transfer for Sequential Multi-Task Linear Bandits [17.970177214029473]
We study lifelong learning in linear bandits, where a learner interacts with a sequence of linear bandit tasks.<n>Current literature assumes that the tasks are diverse, i.e., their parameters uniformly span the $m$-dimensional subspace.<n>We present the first nontrivial result for sequential multi-task linear bandits without the task diversity assumption.
arXiv Detail & Related papers (2025-01-23T05:21:27Z)
Metalearning with Very Few Samples Per Task [19.78398372660794]
We consider a binary classification setting where tasks are related by a shared representation. Here, the amount of data is measured in terms of the number of tasks $t$ that we need to see and the number of samples $n$ per task. Our work also yields a characterization of distribution-free multitask learning and reductions between meta and multitask learning.
arXiv Detail & Related papers (2023-12-21T16:06:44Z)
What Algorithms can Transformers Learn? A Study in Length Generalization [23.970598914609916]
We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer. Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
arXiv Detail & Related papers (2023-10-24T17:43:29Z)
In-Context Learning Creates Task Vectors [40.990432572831885]
In-context learning (ICL) in Large Language Models (LLMs) has emerged as a powerful new learning paradigm. Here we show that the functions learned by ICL often have a very simple structure. We support the above claim via comprehensive experiments across a range of models and tasks.
arXiv Detail & Related papers (2023-10-24T15:17:14Z)
Scaling Distributed Multi-task Reinforcement Learning with Experience Sharing [38.883540444516605]
DARPA launched the ShELL program, which aims to explore how experience sharing can benefit distributed lifelong learning agents. We conduct both theoretical and empirical research on distributed multi-task reinforcement learning (RL), where a group of $N$ agents collaboratively solves $M$ tasks. We propose an algorithm called DistMT-LSVI, where each agent independently learns $epsilon$-optimal policies for all $M$ tasks.
arXiv Detail & Related papers (2023-07-11T22:58:53Z)
On the Sample Complexity of Representation Learning in Multi-task Bandits with Global and Local structure [77.60508571062958]
We investigate the sample complexity of learning the optimal arm for multi-task bandit problems. Arms consist of two components: one that is shared across tasks (that we call representation) and one that is task-specific (that we call predictor) We devise an algorithm OSRL-SC whose sample complexity approaches the lower bound, and scales at most as $H(Glog(delta_G)+ Xlog(delta_H))$, with $X,G,H$ being, respectively, the number of tasks, representations and predictors.
arXiv Detail & Related papers (2022-11-28T08:40:12Z)
Decomposed Prompting: A Modular Approach for Solving Complex Tasks [55.42850359286304]
We propose Decomposed Prompting to solve complex tasks by decomposing them (via prompting) into simpler sub-tasks. This modular structure allows each prompt to be optimized for its specific sub-task. We show that the flexibility and modularity of Decomposed Prompting allows it to outperform prior work on few-shot prompting.
arXiv Detail & Related papers (2022-10-05T17:28:20Z)
Improving Task Generalization via Unified Schema Prompt [87.31158568180514]
Unified Prompt is a flexible and prompting method, which automatically customizes the learnable prompts for each task according to the task input schema. It models the shared knowledge between tasks, while keeping the characteristics of different task schema. The framework achieves strong zero-shot and few-shot performance on 16 unseen tasks downstream from 8 task types.
arXiv Detail & Related papers (2022-08-05T15:26:36Z)
Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph. Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z)
Distribution Matching for Heterogeneous Multi-Task Learning: a Large-scale Face Study [75.42182503265056]
Multi-Task Learning has emerged as a methodology in which multiple tasks are jointly learned by a shared learning algorithm. We deal with heterogeneous MTL, simultaneously addressing detection, classification & regression problems. We build FaceBehaviorNet, the first framework for large-scale face analysis, by jointly learning all facial behavior tasks.
arXiv Detail & Related papers (2021-05-08T22:26:52Z)
On the Theory of Transfer Learning: The Importance of Task Diversity [114.656572506859]
We consider $t+1$ tasks parameterized by functions of the form $f_j circ h$ in a general function class $mathcalF circ mathcalH$. We show that for diverse training tasks the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(mathcalH) + t C(mathcalF)$.
arXiv Detail & Related papers (2020-06-20T20:33:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.