Related papers: MLE-guided parameter search for task loss minimization in neural sequence modeling

MLE-guided parameter search for task loss minimization in neural sequence modeling

URL: http://arxiv.org/abs/2006.03158v2
Date: Mon, 5 Oct 2020 20:46:45 GMT
Title: MLE-guided parameter search for task loss minimization in neural sequence modeling
Authors: Sean Welleck, Kyunghyun Cho
Abstract summary: Neural autoregressive sequence models are used to generate sequences in a variety of natural language processing (NLP) tasks. We propose maximum likelihood guided parameter search (MGS), which samples from a distribution over update directions that is a mixture of random search around the current parameters and around the maximum likelihood gradient. Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation.
Score: 83.83249536279239
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neural autoregressive sequence models are used to generate sequences in a variety of natural language processing (NLP) tasks, where they are evaluated according to sequence-level task losses. These models are typically trained with maximum likelihood estimation, which ignores the task loss, yet empirically performs well as a surrogate objective. Typical approaches to directly optimizing the task loss such as policy gradient and minimum risk training are based around sampling in the sequence space to obtain candidate update directions that are scored based on the loss of a single sequence. In this paper, we develop an alternative method based on random search in the parameter space that leverages access to the maximum likelihood gradient. We propose maximum likelihood guided parameter search (MGS), which samples from a distribution over update directions that is a mixture of random search around the current parameters and around the maximum likelihood gradient, with each direction weighted by its improvement in the task loss. MGS shifts sampling to the parameter space, and scores candidates using losses that are pooled from multiple sequences. Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation.

Related papers

Deep Minimax Classifiers for Imbalanced Datasets with a Small Number of Minority Samples [5.217870815854702]
We propose a novel minimax learning algorithm designed to minimize the risk of worst-performing classes. Our proposed algorithm has a provable convergence property, and empirical results indicate that our algorithm performs better than or is comparable to existing methods.
arXiv Detail & Related papers (2025-02-24T08:20:02Z)
Task Shift: From Classification to Regression in Overparameterized Linear Models [5.030445392527011]
We investigate a phenomenon where latent knowledge is transferred to a more difficult task under a similar data distribution. We show that while minimum-norm interpolators for classification cannot transfer to regression a priori, they experience surprisingly structured attenuation which enables successful task shift with limited additional data. Our results show that while minimum-norm interpolators for classification cannot transfer to regression a priori, they experience surprisingly structured attenuation which enables successful task shift with limited additional data.
arXiv Detail & Related papers (2025-02-18T21:16:01Z)
Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach [17.79010397902909]
We study the problem of fine-tuning a language model (LM) for a target task by optimally using the information from $n$ auxiliary tasks. This problem has broad applications in NLP, such as targeted instruction tuning and data selection in chain-of-thought fine-tuning. We introduce a new algorithm to estimate model fine-tuning performances without repeated training.
arXiv Detail & Related papers (2024-09-28T21:26:50Z)
Hessian Aware Low-Rank Perturbation for Order-Robust Continual Learning [19.850893012601638]
Continual learning aims to learn a series of tasks sequentially without forgetting the knowledge acquired from the previous ones. We propose the Hessian Aware Low-Rank Perturbation algorithm for continual learning.
arXiv Detail & Related papers (2023-11-26T01:44:01Z)
SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences. We formulate sequence generation as an imitation learning (IL) problem. This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset. Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z)
Unsupervised Learning of Initialization in Deep Neural Networks via Maximum Mean Discrepancy [74.34895342081407]
We propose an unsupervised algorithm to find good initialization for input data. We first notice that each parameter configuration in the parameter space corresponds to one particular downstream task of d-way classification. We then conjecture that the success of learning is directly related to how diverse downstream tasks are in the vicinity of the initial parameters.
arXiv Detail & Related papers (2023-02-08T23:23:28Z)
Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment. Policy gradients for local search are often obtained from random perturbations. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z)
Transfer Bayesian Meta-learning via Weighted Free Energy Minimization [37.51664463278401]
A key assumption is that the auxiliary tasks, known as meta-training tasks, share the same generating distribution as the tasks to be encountered at deployment time. This paper introduces weighted free energy minimization (WFEM) for transfer meta-learning.
arXiv Detail & Related papers (2021-06-20T15:17:51Z)
Optimal quantisation of probability measures using maximum mean discrepancy [10.29438865750845]
Several researchers have proposed minimisation of maximum mean discrepancy (MMD) as a method to quantise probability measures. We consider sequential algorithms that greedily minimise MMD over a discrete candidate set. We investigate a variant that applies this technique to a mini-batch of the candidate set at each iteration.
arXiv Detail & Related papers (2020-10-14T13:09:48Z)
Fast OSCAR and OWL Regression via Safe Screening Rules [97.28167655721766]
Ordered $L_1$ (OWL) regularized regression is a new regression analysis for high-dimensional sparse learning. Proximal gradient methods are used as standard approaches to solve OWL regression. We propose the first safe screening rule for OWL regression by exploring the order of the primal solution with the unknown order structure.
arXiv Detail & Related papers (2020-06-29T23:35:53Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)
Resolving learning rates adaptively by locating Stochastic Non-Negative Associated Gradient Projection Points using line searches [0.0]
Learning rates in neural network training are currently determined a priori to training using expensive manual or automated tuning. This study proposes gradient-only line searches to resolve the learning rate for neural network training algorithms.
arXiv Detail & Related papers (2020-01-15T03:08:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.