Controlling Computation versus Quality for Neural Sequence Models
- URL: http://arxiv.org/abs/2002.07106v2
- Date: Thu, 16 Apr 2020 15:01:45 GMT
- Title: Controlling Computation versus Quality for Neural Sequence Models
- Authors: Ankur Bapna, Naveen Arivazhagan, Orhan Firat
- Abstract summary: Conditional computation makes neural sequence models (Transformers) more efficient and computation-aware during inference.
We evaluate our approach on two tasks: (i) WMT English-French Translation and (ii) Unsupervised representation learning (BERT)
- Score: 42.525463454120256
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most neural networks utilize the same amount of compute for every example
independent of the inherent complexity of the input. Further, methods that
adapt the amount of computation to the example focus on finding a fixed
inference-time computational graph per example, ignoring any external
computational budgets or varying inference time limitations. In this work, we
utilize conditional computation to make neural sequence models (Transformer)
more efficient and computation-aware during inference. We first modify the
Transformer architecture, making each set of operations conditionally
executable depending on the output of a learned control network. We then train
this model in a multi-task setting, where each task corresponds to a particular
computation budget. This allows us to train a single model that can be
controlled to operate on different points of the computation-quality trade-off
curve, depending on the available computation budget at inference time. We
evaluate our approach on two tasks: (i) WMT English-French Translation and (ii)
Unsupervised representation learning (BERT). Our experiments demonstrate that
the proposed Conditional Computation Transformer (CCT) is competitive with
vanilla Transformers when allowed to utilize its full computational budget,
while improving significantly over computationally equivalent baselines when
operating on smaller computational budgets.
Related papers
- Predicting Probabilities of Error to Combine Quantization and Early Exiting: QuEE [68.6018458996143]
We propose a more general dynamic network that can combine both quantization and early exit dynamic network: QuEE.
Our algorithm can be seen as a form of soft early exiting or input-dependent compression.
The crucial factor of our approach is accurate prediction of the potential accuracy improvement achievable through further computation.
arXiv Detail & Related papers (2024-06-20T15:25:13Z) - Discrete Neural Algorithmic Reasoning [18.497863598167257]
We propose to force neural reasoners to maintain the execution trajectory as a combination of finite predefined states.
trained with supervision on the algorithm's state transitions, such models are able to perfectly align with the original algorithm.
arXiv Detail & Related papers (2024-02-18T16:03:04Z) - Efficient Controllable Multi-Task Architectures [85.76598445904374]
We propose a multi-task model consisting of a shared encoder and task-specific decoders where both encoder and decoder channel widths are slimmable.
Our key idea is to control the task importance by varying the capacities of task-specific decoders, while controlling the total computational cost.
This improves overall accuracy by allowing a stronger encoder for a given budget, increases control over computational cost, and delivers high-quality slimmed sub-architectures.
arXiv Detail & Related papers (2023-08-22T19:09:56Z) - Fast Training of NMT Model with Data Sorting [0.0]
The Transformer model has revolutionized Natural Language Processing tasks such as Neural Machine Translation.
One potential area for improvement is to address the study of empty tokens that the Transformer computes only to discard them later.
We propose an algorithm that sorts sentence pairs based on their length before translation, minimizing the waste of computing power.
arXiv Detail & Related papers (2023-08-16T05:48:50Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - RWKV: Reinventing RNNs for the Transformer Era [54.716108899349614]
We propose a novel model architecture that combines the efficient parallelizable training of transformers with the efficient inference of RNNs.
We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers.
arXiv Detail & Related papers (2023-05-22T13:57:41Z) - CC-FedAvg: Computationally Customized Federated Averaging [11.687451505965655]
Federated learning (FL) is an emerging paradigm to train model with distributed data from numerous Internet of Things (IoT) devices.
We propose a strategy for estimating local models without computationally intensive iterations.
We show that CC-FedAvg has the same convergence rate and comparable performance as FedAvg without resource constraints.
arXiv Detail & Related papers (2022-12-28T03:32:29Z) - Berrut Approximated Coded Computing: Straggler Resistance Beyond
Polynomial Computing [34.69732430310801]
We propose Berrut Approximated Coded Computing (BACC) as an alternative approach to deal with stragglers effect.
BACC is proven to be numerically stable with low computational complexity.
In particular, BACC is used to train a deep neural network on a cluster of servers.
arXiv Detail & Related papers (2020-09-17T14:23:38Z) - Straggler-aware Distributed Learning: Communication Computation Latency
Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations.
In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations.
Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.