Related papers: Surprisal-Triggered Conditional Computation with Neural Networks

Surprisal-Triggered Conditional Computation with Neural Networks

URL: http://arxiv.org/abs/2006.01659v1
Date: Tue, 2 Jun 2020 14:34:24 GMT
Title: Surprisal-Triggered Conditional Computation with Neural Networks
Authors: Loren Lugosch, Derek Nowrouzezahrai, Brett H. Meyer
Abstract summary: Autoregressive neural network models have been used successfully for sequence generation, feature extraction, and hypothesis scoring. This paper presents yet another use for these models: allocating more computation to more difficult inputs.
Score: 19.55737970532817
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive neural network models have been used successfully for sequence generation, feature extraction, and hypothesis scoring. This paper presents yet another use for these models: allocating more computation to more difficult inputs. In our model, an autoregressive model is used both to extract features and to predict observations in a stream of input observations. The surprisal of the input, measured as the negative log-likelihood of the current observation according to the autoregressive model, is used as a measure of input difficulty. This in turn determines whether a small, fast network, or a big, slow network, is used. Experiments on two speech recognition tasks show that our model can match the performance of a baseline in which the big network is always used with 15% fewer FLOPs.

Related papers

Power Failure Cascade Prediction using Graph Neural Networks [4.667031410586657]
We propose a flow-free model that predicts grid states at every generation of a cascade process given an initial contingency and power injection values. We show that the proposed model reduces the computational time by almost two orders of magnitude.
arXiv Detail & Related papers (2024-04-24T18:45:50Z)
A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization. Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z)
Instance-wise Linearization of Neural Network for Model Interpretation [13.583425552511704]
The challenge can dive into the non-linear behavior of the neural network. For a neural network model, the non-linear behavior is often caused by non-linear activation units of a model. We propose an instance-wise linearization approach to reformulates the forward computation process of a neural network prediction.
arXiv Detail & Related papers (2023-10-25T02:07:39Z)
Residual Multi-Fidelity Neural Network Computing [0.0]
We consider the general problem of constructing a neural network surrogate model using multi-fidelity information. Motivated by error-complexity estimates for ReLU neural networks, we formulate the correlation between an inexpensive low-fidelity model and an expensive high-fidelity model. We present four numerical examples to demonstrate the power of the proposed framework.
arXiv Detail & Related papers (2023-10-05T14:43:16Z)
Deep Networks as Denoising Algorithms: Sample-Efficient Learning of Diffusion Models in High-Dimensional Graphical Models [22.353510613540564]
We investigate the approximation efficiency of score functions by deep neural networks in generative modeling. We observe score functions can often be well-approximated in graphical models through variational inference denoising algorithms. We provide an efficient sample complexity bound for diffusion-based generative modeling when the score function is learned by deep neural networks.
arXiv Detail & Related papers (2023-09-20T15:51:10Z)
How neural networks learn to classify chaotic time series [77.34726150561087]
We study the inner workings of neural networks trained to classify regular-versus-chaotic time series. We find that the relation between input periodicity and activation periodicity is key for the performance of LKCNN models.
arXiv Detail & Related papers (2023-06-04T08:53:27Z)
Learning to Jump: Thinning and Thickening Latent Counts for Generative Modeling [69.60713300418467]
Learning to jump is a general recipe for generative modeling of various types of data. We demonstrate when learning to jump is expected to perform comparably to learning to denoise, and when it is expected to perform better.
arXiv Detail & Related papers (2023-05-28T05:38:28Z)
Continuous time recurrent neural networks: overview and application to forecasting blood glucose in the intensive care unit [56.801856519460465]
Continuous time autoregressive recurrent neural networks (CTRNNs) are a deep learning model that account for irregular observations. We demonstrate the application of these models to probabilistic forecasting of blood glucose in a critical care setting.
arXiv Detail & Related papers (2023-04-14T09:39:06Z)
Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks [62.48782506095565]
We show that due to the greedy nature of learning in deep neural networks, models tend to rely on just one modality while under-fitting the other modalities. We propose an algorithm to balance the conditional learning speeds between modalities during training and demonstrate that it indeed addresses the issue of greedy learning.
arXiv Detail & Related papers (2022-02-10T20:11:21Z)
Improving Video Instance Segmentation by Light-weight Temporal Uncertainty Estimates [11.580916951856256]
We present a time-dynamic approach to model uncertainties of instance segmentation networks. We apply this approach to the detection of false positives and the estimation of prediction quality. The proposed method only requires a readily trained neural network and video sequence input.
arXiv Detail & Related papers (2020-12-14T13:39:05Z)
A Bayesian Perspective on Training Speed and Model Selection [51.15664724311443]
We show that a measure of a model's training speed can be used to estimate its marginal likelihood. We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks. Our results suggest a promising new direction towards explaining why neural networks trained with gradient descent are biased towards functions that generalize well.
arXiv Detail & Related papers (2020-10-27T17:56:14Z)
Dynamic Time Warping as a New Evaluation for Dst Forecast with Machine Learning [0.0]
We train a neural network to make a forecast of the disturbance storm time index at origin time $t$ with a forecasting horizon of 1 up to 6 hours. Inspection of the model's results with the correlation coefficient and RMSE indicated a performance comparable to the latest publications. A new method is proposed to measure whether two time series are shifted in time with respect to each other.
arXiv Detail & Related papers (2020-06-08T15:14:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.