Related papers: Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks

Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks

URL: http://arxiv.org/abs/2307.06887v5
Date: Fri, 7 Jun 2024 02:12:08 GMT
Title: Provable Multi-Task Representation Learning by Two-Layer ReLU Neural Networks
Authors: Liam Collins, Hamed Hassani, Mahdi Soltanolkotabi, Aryan Mokhtari, Sanjay Shakkottai,
Abstract summary: We present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks.
Score: 69.38572074372392
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: An increasingly popular machine learning paradigm is to pretrain a neural network (NN) on many tasks offline, then adapt it to downstream tasks, often by re-training only the last linear layer of the network. This approach yields strong downstream performance in a variety of contexts, demonstrating that multitask pretraining leads to effective feature learning. Although several recent theoretical studies have shown that shallow NNs learn meaningful features when either (i) they are trained on a {\em single} task or (ii) they are {\em linear}, very little is known about the closer-to-practice case of {\em nonlinear} NNs trained on {\em multiple} tasks. In this work, we present the first results proving that feature learning occurs during training with a nonlinear model on multiple tasks. Our key insight is that multi-task pretraining induces a pseudo-contrastive loss that favors representations that align points that typically have the same label across tasks. Using this observation, we show that when the tasks are binary classification tasks with labels depending on the projection of the data onto an $r$-dimensional subspace within the $d\gg r$-dimensional input space, a simple gradient-based multitask learning algorithm on a two-layer ReLU NN recovers this projection, allowing for generalization to downstream tasks with sample and neuron complexity independent of $d$. In contrast, we show that with high probability over the draw of a single task, training on this single task cannot guarantee to learn all $r$ ground-truth features.

Related papers

Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners [60.75160178669076]
We show that the use of high-capacity value models trained via cross-entropy and conditioned on learnable task embeddings addresses the problem of task interference in online reinforcement learning.<n>We test our approach on 7 multi-task benchmarks with over 280 unique tasks, spanning high degree-of-freedom humanoid control and discrete vision-based RL.
arXiv Detail & Related papers (2025-05-29T06:41:45Z)
Negotiated Representations to Prevent Forgetting in Machine Learning Applications [0.0]
Catastrophic forgetting is a significant challenge in the field of machine learning. We propose a novel method for preventing catastrophic forgetting in machine learning applications.
arXiv Detail & Related papers (2023-11-30T22:43:50Z)
ULTRA-DP: Unifying Graph Pre-training with Multi-task Graph Dual Prompt [67.8934749027315]
We propose a unified framework for graph hybrid pre-training which injects the task identification and position identification into GNNs. We also propose a novel pre-training paradigm based on a group of $k$-nearest neighbors.
arXiv Detail & Related papers (2023-10-23T12:11:13Z)
Diffused Redundancy in Pre-trained Representations [98.55546694886819]
We take a closer look at how features are encoded in pre-trained representations. We find that learned representations in a given layer exhibit a degree of diffuse redundancy. Our findings shed light on the nature of representations learned by pre-trained deep neural networks.
arXiv Detail & Related papers (2023-05-31T21:00:50Z)
Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners [67.5865966762559]
We study whether sparsely activated Mixture-of-Experts (MoE) improve multi-task learning. We devise task-aware gating functions to route examples from different tasks to specialized experts. This results in a sparsely activated multi-task model with a large number of parameters, but with the same computational cost as that of a dense model.
arXiv Detail & Related papers (2022-04-16T00:56:12Z)
Representation Learning Beyond Linear Prediction Functions [33.94130046391917]
We show that diversity can be achieved when source tasks and the target task use different prediction function spaces beyond linear functions. For a general function class, we find that eluder dimension gives a lower bound on the number of tasks required for diversity.
arXiv Detail & Related papers (2021-05-31T14:21:52Z)
Beneficial Perturbation Network for designing general adaptive artificial intelligence systems [14.226973149346886]
We propose a new type of deep neural network with extra, out-of-network, task-dependent biasing units to accommodate dynamic situations. Our approach is memory-efficient and parameter-efficient, can accommodate many tasks, and achieves state-of-the-art performance across different tasks and domains.
arXiv Detail & Related papers (2020-09-27T01:28:10Z)
Reparameterizing Convolutions for Incremental Multi-Task Learning without Task Interference [75.95287293847697]
Two common challenges in developing multi-task models are often overlooked in literature. First, enabling the model to be inherently incremental, continuously incorporating information from new tasks without forgetting the previously learned ones (incremental learning) Second, eliminating adverse interactions amongst tasks, which has been shown to significantly degrade the single-task performance in a multi-task setup (task interference)
arXiv Detail & Related papers (2020-07-24T14:44:46Z)
Deep Multi-Task Augmented Feature Learning via Hierarchical Graph Neural Network [4.121467410954028]
We propose a Hierarchical Graph Neural Network to learn augmented features for deep multi-task learning. Experiments on real-world datastes show the significant performance improvement when using this strategy.
arXiv Detail & Related papers (2020-02-12T06:02:20Z)
Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks [95.51368472949308]
Adaptation can be useful in cases when training data is scarce, or when one wishes to encode priors in the network. In this paper, we propose a straightforward alternative: side-tuning.
arXiv Detail & Related papers (2019-12-31T18:52:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.