Alleviate Exposure Bias in Sequence Prediction \\ with Recurrent Neural
Networks
- URL: http://arxiv.org/abs/2103.11603v1
- Date: Mon, 22 Mar 2021 06:15:22 GMT
- Title: Alleviate Exposure Bias in Sequence Prediction \\ with Recurrent Neural
Networks
- Authors: Liping Yuan, Jiangtao Feng, Xiaoqing Zheng, Xuanjing Huang
- Abstract summary: A popular strategy to train recurrent neural networks (RNNs) is to take the ground truth as input at each time step.
We propose a fully differentiable training algorithm for RNNs to better capture long-term dependencies.
- Score: 47.52214243454995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A popular strategy to train recurrent neural networks (RNNs), known as
``teacher forcing'' takes the ground truth as input at each time step and makes
the later predictions partly conditioned on those inputs. Such training
strategy impairs their ability to learn rich distributions over entire
sequences because the chosen inputs hinders the gradients back-propagating to
all previous states in an end-to-end manner. We propose a fully differentiable
training algorithm for RNNs to better capture long-term dependencies by
recovering the probability of the whole sequence. The key idea is that at each
time step, the network takes as input a ``bundle'' of similar words predicted
at the previous step instead of a single ground truth. The representations of
these similar words forms a convex hull, which can be taken as a kind of
regularization to the input. Smoothing the inputs by this way makes the whole
process trainable and differentiable. This design makes it possible for the
model to explore more feasible combinations (possibly unseen sequences), and
can be interpreted as a computationally efficient approximation to the beam
search. Experiments on multiple sequence generation tasks yield performance
improvements, especially in sequence-level metrics, such as BLUE or ROUGE-2.
Related papers
- Distributive Pre-Training of Generative Modeling Using Matrix-Product
States [0.0]
We consider an alternative training scheme utilizing basic tensor network operations, e.g., summation and compression.
The training algorithm is based on compressing the superposition state constructed from all the training data in product state representation.
We benchmark the algorithm on the MNIST dataset and show reasonable results for generating new images and classification tasks.
arXiv Detail & Related papers (2023-06-26T15:46:08Z) - Return of the RNN: Residual Recurrent Networks for Invertible Sentence
Embeddings [0.0]
This study presents a novel model for invertible sentence embeddings using a residual recurrent network trained on an unsupervised encoding task.
Rather than the probabilistic outputs common to neural machine translation models, our approach employs a regression-based output layer to reconstruct the input sequence's word vectors.
The model achieves high accuracy and fast training with the ADAM, a significant finding given that RNNs typically require memory units, such as LSTMs, or second-order optimization methods.
arXiv Detail & Related papers (2023-03-23T15:59:06Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Boosted Dynamic Neural Networks [53.559833501288146]
A typical EDNN has multiple prediction heads at different layers of the network backbone.
To optimize the model, these prediction heads together with the network backbone are trained on every batch of training data.
Treating training and testing inputs differently at the two phases will cause the mismatch between training and testing data distributions.
We formulate an EDNN as an additive model inspired by gradient boosting, and propose multiple training techniques to optimize the model effectively.
arXiv Detail & Related papers (2022-11-30T04:23:12Z) - Towards Better Out-of-Distribution Generalization of Neural Algorithmic
Reasoning Tasks [51.8723187709964]
We study the OOD generalization of neural algorithmic reasoning tasks.
The goal is to learn an algorithm from input-output pairs using deep neural networks.
arXiv Detail & Related papers (2022-11-01T18:33:20Z) - Pretraining Graph Neural Networks for few-shot Analog Circuit Modeling
and Design [68.1682448368636]
We present a supervised pretraining approach to learn circuit representations that can be adapted to new unseen topologies or unseen prediction tasks.
To cope with the variable topological structure of different circuits we describe each circuit as a graph and use graph neural networks (GNNs) to learn node embeddings.
We show that pretraining GNNs on prediction of output node voltages can encourage learning representations that can be adapted to new unseen topologies or prediction of new circuit level properties.
arXiv Detail & Related papers (2022-03-29T21:18:47Z) - Dual Lottery Ticket Hypothesis [71.95937879869334]
Lottery Ticket Hypothesis (LTH) provides a novel view to investigate sparse network training and maintain its capacity.
In this work, we regard the winning ticket from LTH as the subnetwork which is in trainable condition and its performance as our benchmark.
We propose a simple sparse network training strategy, Random Sparse Network Transformation (RST), to substantiate our DLTH.
arXiv Detail & Related papers (2022-03-08T18:06:26Z) - SparseGAN: Sparse Generative Adversarial Network for Text Generation [8.634962333084724]
We propose a SparseGAN that generates semantic-interpretable, but sparse sentence representations as inputs to the discriminator.
With such semantic-rich representations, we not only reduce unnecessary noises for efficient adversarial training, but also make the entire training process fully differentiable.
arXiv Detail & Related papers (2021-03-22T04:44:43Z) - Nested Learning For Multi-Granular Tasks [24.600419295290504]
generalize poorly to samples that are not from original training distribution.
Standard deep neural networks (DNNs) are commonly trained in an end-to-end fashion for specific tasks.
We introduce the concept of nested learning: how to obtain a hierarchical representation of the input.
We show that nested learning outperforms the same network trained in the standard end-to-end fashion.
arXiv Detail & Related papers (2020-07-13T14:27:14Z) - Neural Execution Engines: Learning to Execute Subroutines [29.036699193820215]
We study the generalization issues at the level of numerical subroutines that comprise common algorithms like sorting, shortest paths, and minimum spanning trees.
To generalize to unseen data, we show that encoding numbers with a binary representation leads to embeddings with rich structure once trained on downstream tasks like addition or multiplication.
arXiv Detail & Related papers (2020-06-15T01:51:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.