MLE-guided parameter search for task loss minimization in neural
sequence modeling
- URL: http://arxiv.org/abs/2006.03158v2
- Date: Mon, 5 Oct 2020 20:46:45 GMT
- Title: MLE-guided parameter search for task loss minimization in neural
sequence modeling
- Authors: Sean Welleck, Kyunghyun Cho
- Abstract summary: Neural autoregressive sequence models are used to generate sequences in a variety of natural language processing (NLP) tasks.
We propose maximum likelihood guided parameter search (MGS), which samples from a distribution over update directions that is a mixture of random search around the current parameters and around the maximum likelihood gradient.
Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation.
- Score: 83.83249536279239
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural autoregressive sequence models are used to generate sequences in a
variety of natural language processing (NLP) tasks, where they are evaluated
according to sequence-level task losses. These models are typically trained
with maximum likelihood estimation, which ignores the task loss, yet
empirically performs well as a surrogate objective. Typical approaches to
directly optimizing the task loss such as policy gradient and minimum risk
training are based around sampling in the sequence space to obtain candidate
update directions that are scored based on the loss of a single sequence. In
this paper, we develop an alternative method based on random search in the
parameter space that leverages access to the maximum likelihood gradient. We
propose maximum likelihood guided parameter search (MGS), which samples from a
distribution over update directions that is a mixture of random search around
the current parameters and around the maximum likelihood gradient, with each
direction weighted by its improvement in the task loss. MGS shifts sampling to
the parameter space, and scores candidates using losses that are pooled from
multiple sequences. Our experiments show that MGS is capable of optimizing
sequence-level losses, with substantial reductions in repetition and
non-termination in sequence completion, and similar improvements to those of
minimum risk training in machine translation.
Related papers
- Hessian Aware Low-Rank Perturbation for Order-Robust Continual Learning [19.850893012601638]
Continual learning aims to learn a series of tasks sequentially without forgetting the knowledge acquired from the previous ones.
We propose the Hessian Aware Low-Rank Perturbation algorithm for continual learning.
arXiv Detail & Related papers (2023-11-26T01:44:01Z) - SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z) - Unsupervised Learning of Initialization in Deep Neural Networks via
Maximum Mean Discrepancy [74.34895342081407]
We propose an unsupervised algorithm to find good initialization for input data.
We first notice that each parameter configuration in the parameter space corresponds to one particular downstream task of d-way classification.
We then conjecture that the success of learning is directly related to how diverse downstream tasks are in the vicinity of the initial parameters.
arXiv Detail & Related papers (2023-02-08T23:23:28Z) - KaFiStO: A Kalman Filtering Framework for Stochastic Optimization [27.64040983559736]
We show that when training neural networks the loss function changes over (iteration) time due to the randomized selection of a subset of the samples.
This randomization turns the optimization problem into an optimum one.
We propose to consider the loss as a noisy observation with respect to some reference.
arXiv Detail & Related papers (2021-07-07T16:13:57Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Transfer Bayesian Meta-learning via Weighted Free Energy Minimization [37.51664463278401]
A key assumption is that the auxiliary tasks, known as meta-training tasks, share the same generating distribution as the tasks to be encountered at deployment time.
This paper introduces weighted free energy minimization (WFEM) for transfer meta-learning.
arXiv Detail & Related papers (2021-06-20T15:17:51Z) - Optimal quantisation of probability measures using maximum mean
discrepancy [10.29438865750845]
Several researchers have proposed minimisation of maximum mean discrepancy (MMD) as a method to quantise probability measures.
We consider sequential algorithms that greedily minimise MMD over a discrete candidate set.
We investigate a variant that applies this technique to a mini-batch of the candidate set at each iteration.
arXiv Detail & Related papers (2020-10-14T13:09:48Z) - Fast OSCAR and OWL Regression via Safe Screening Rules [97.28167655721766]
Ordered $L_1$ (OWL) regularized regression is a new regression analysis for high-dimensional sparse learning.
Proximal gradient methods are used as standard approaches to solve OWL regression.
We propose the first safe screening rule for OWL regression by exploring the order of the primal solution with the unknown order structure.
arXiv Detail & Related papers (2020-06-29T23:35:53Z) - Pre-training Is (Almost) All You Need: An Application to Commonsense
Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks.
We introduce a new scoring method that casts a plausibility ranking task in a full-text format.
We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z) - Resolving learning rates adaptively by locating Stochastic Non-Negative
Associated Gradient Projection Points using line searches [0.0]
Learning rates in neural network training are currently determined a priori to training using expensive manual or automated tuning.
This study proposes gradient-only line searches to resolve the learning rate for neural network training algorithms.
arXiv Detail & Related papers (2020-01-15T03:08:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.