Discrete Key-Value Bottleneck
        - URL: http://arxiv.org/abs/2207.11240v3
- Date: Mon, 12 Jun 2023 15:30:22 GMT
- Title: Discrete Key-Value Bottleneck
- Authors: Frederik Tr\"auble, Anirudh Goyal, Nasim Rahaman, Michael Mozer, Kenji
  Kawaguchi, Yoshua Bengio, Bernhard Sch\"olkopf
- Abstract summary: Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant.
One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning.
Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks.
We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
- Score: 95.61236311369821
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Deep neural networks perform well on classification tasks where data streams
are i.i.d. and labeled data is abundant. Challenges emerge with non-stationary
training data streams such as continual learning. One powerful approach that
has addressed this challenge involves pre-training of large encoders on volumes
of readily available data, followed by task-specific tuning. Given a new task,
however, updating the weights of these encoders is challenging as a large
number of weights needs to be fine-tuned, and as a result, they forget
information about the previous tasks. In the present work, we propose a model
architecture to address this issue, building upon a discrete bottleneck
containing pairs of separate and learnable key-value codes. Our paradigm will
be to encode; process the representation via a discrete bottleneck; and decode.
Here, the input is fed to the pre-trained encoder, the output of the encoder is
used to select the nearest keys, and the corresponding values are fed to the
decoder to solve the current task. The model can only fetch and re-use a sparse
number of these key-value pairs during inference, enabling localized and
context-dependent model updates. We theoretically investigate the ability of
the discrete key-value bottleneck to minimize the effect of learning under
distribution shifts and show that it reduces the complexity of the hypothesis
class. We empirically verify the proposed method under challenging
class-incremental learning scenarios and show that the proposed model - without
any task boundaries - reduces catastrophic forgetting across a wide variety of
pre-trained models, outperforming relevant baselines on this task.
 
      
        Related papers
        - These Are Not All the Features You Are Looking For: A Fundamental   Bottleneck in Supervised Pretraining [10.749875317643031]
 Transfer learning is a cornerstone of modern machine learning, promising a way to adapt models pretrained on a broad mix of data to new tasks with minimal new data.<n>We evaluate model transfer from a pretraining mixture to each of its component tasks, assessing whether pretrained features can match the performance of task-specific direct training.<n>We identify a fundamental limitation in deep learning models, where networks fail to learn new features once they encode similar competing features during training.
 arXiv  Detail & Related papers  (2025-06-23T01:04:29Z)
- Continual Learning for Encoder-only Language Models via a Discrete   Key-Value Bottleneck [6.137272725645159]
 We introduce a discrete key-value bottleneck for encoder-only language models.
Inspired by the success of a discrete key-value bottleneck in vision, we address new and NLP-specific challenges.
 arXiv  Detail & Related papers  (2024-12-11T16:38:34Z)
- A Fresh Take on Stale Embeddings: Improving Dense Retriever Training   with Corrector Networks [81.2624272756733]
 In dense retrieval, deep encoders provide embeddings for both inputs and targets.
We train a small parametric corrector network that adjusts stale cached target embeddings.
Our approach matches state-of-the-art results even when no target embedding updates are made during training.
 arXiv  Detail & Related papers  (2024-09-03T13:29:13Z)
- Complementary Learning Subnetworks for Parameter-Efficient
  Class-Incremental Learning [40.13416912075668]
 We propose a rehearsal-free CIL approach that learns continually via the synergy between two Complementary Learning Subnetworks.
Our method achieves competitive results against state-of-the-art methods, especially in accuracy gain, memory cost, training efficiency, and task-order.
 arXiv  Detail & Related papers  (2023-06-21T01:43:25Z)
- Enhancing Multiple Reliability Measures via Nuisance-extended
  Information Bottleneck [77.37409441129995]
 In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition.
We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training.
We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
 arXiv  Detail & Related papers  (2023-03-24T16:03:21Z)
- BatchFormer: Learning to Explore Sample Relationships for Robust
  Representation Learning [93.38239238988719]
 We propose to enable deep neural networks with the ability to learn the sample relationships from each mini-batch.
 BatchFormer is applied into the batch dimension of each mini-batch to implicitly explore sample relationships during training.
We perform extensive experiments on over ten datasets and the proposed method achieves significant improvements on different data scarcity applications.
 arXiv  Detail & Related papers  (2022-03-03T05:31:33Z)
- Lifelong Learning Without a Task Oracle [13.331659934508764]
 Supervised deep neural networks are known to undergo a sharp decline in the accuracy of older tasks when new tasks are learned.
We propose and compare several candidate task-assigning mappers which require very little memory overhead.
Best-performing variants only impose an average cost of 1.7% parameter memory increase.
 arXiv  Detail & Related papers  (2020-11-09T21:30:31Z)
- Learning to Count in the Crowd from Limited Labeled Data [109.2954525909007]
 We focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples.
Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data.
 arXiv  Detail & Related papers  (2020-07-07T04:17:01Z)
- Laplacian Denoising Autoencoder [114.21219514831343]
 We propose to learn data representations with a novel type of denoising autoencoder.
The noisy input data is generated by corrupting latent clean data in the gradient domain.
 Experiments on several visual benchmarks demonstrate that better representations can be learned with the proposed approach.
 arXiv  Detail & Related papers  (2020-03-30T16:52:39Z)
- Conditional Mutual information-based Contrastive Loss for Financial Time
  Series Forecasting [12.0855096102517]
 We present a representation learning framework for financial time series forecasting.
In this paper, we propose to first learn compact representations from time series data, then use the learned representations to train a simpler model for predicting time series movements.
 arXiv  Detail & Related papers  (2020-02-18T15:24:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.