Smoothing and Shrinking the Sparse Seq2Seq Search Space
- URL: http://arxiv.org/abs/2103.10291v1
- Date: Thu, 18 Mar 2021 14:45:38 GMT
- Title: Smoothing and Shrinking the Sparse Seq2Seq Search Space
- Authors: Ben Peters and Andr\'e F. T. Martins
- Abstract summary: We show that entmax-based models effectively solve the cat got your tongue problem.
We also generalize label smoothing to the broader family of Fenchel-Young losses.
Our resulting label-smoothed entmax loss models set a new state of the art on multilingual grapheme-to-phoneme conversion.
- Score: 2.1828601975620257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current sequence-to-sequence models are trained to minimize cross-entropy and
use softmax to compute the locally normalized probabilities over target
sequences. While this setup has led to strong results in a variety of tasks,
one unsatisfying aspect is its length bias: models give high scores to short,
inadequate hypotheses and often make the empty string the argmax -- the
so-called cat got your tongue problem. Recently proposed entmax-based sparse
sequence-to-sequence models present a possible solution, since they can shrink
the search space by assigning zero probability to bad hypotheses, but their
ability to handle word-level tasks with transformers has never been tested. In
this work, we show that entmax-based models effectively solve the cat got your
tongue problem, removing a major source of model error for neural machine
translation. In addition, we generalize label smoothing, a critical
regularization technique, to the broader family of Fenchel-Young losses, which
includes both cross-entropy and the entmax losses. Our resulting label-smoothed
entmax loss models set a new state of the art on multilingual
grapheme-to-phoneme conversion and deliver improvements and better calibration
properties on cross-lingual morphological inflection and machine translation
for 6 language pairs.
Related papers
- Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - A Pseudo-Semantic Loss for Autoregressive Models with Logical
Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning.
We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution.
We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z) - Variational Classification [51.2541371924591]
We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders.
Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency.
We induce a chosen latent distribution, instead of the implicit assumption found in a standard softmax layer.
arXiv Detail & Related papers (2023-05-17T17:47:19Z) - r-softmax: Generalized Softmax with Controllable Sparsity Rate [11.39524236962986]
We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate.
We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
arXiv Detail & Related papers (2023-04-11T14:28:29Z) - Jam or Cream First? Modeling Ambiguity in Neural Machine Translation
with SCONES [10.785577504399077]
We propose to replace the softmax activation with a multi-label classification layer that can model ambiguity more effectively.
We show that the multi-label output layer can still be trained on single reference training data using the SCONES loss function.
We demonstrate that SCONES can be used to train NMT models that assign the highest probability to adequate translations.
arXiv Detail & Related papers (2022-05-02T07:51:37Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Stochastic Projective Splitting: Solving Saddle-Point Problems with
Multiple Regularizers [4.568911586155097]
We present a new, variant of the projective splitting (PS) family of monotone algorithms for inclusion problems.
It can solve min-max and noncooperative game formulations arising in applications such as robust ML without the convergence issues associated with gradient descent-ascent.
arXiv Detail & Related papers (2021-06-24T14:48:43Z) - Investigation of Large-Margin Softmax in Neural Language Modeling [43.51826343967195]
We investigate if introducing large-margins to neural language models would improve the perplexity and consequently word error rate in automatic speech recognition.
We find that although perplexity is slightly deteriorated, neural language models with large-margin softmax can yield word error rate similar to that of the standard softmax baseline.
arXiv Detail & Related papers (2020-05-20T14:53:19Z) - Aligned Cross Entropy for Non-Autoregressive Machine Translation [120.15069387374717]
We propose aligned cross entropy (AXE) as an alternative loss function for training of non-autoregressive models.
AXE-based training of conditional masked language models (CMLMs) substantially improves performance on major WMT benchmarks.
arXiv Detail & Related papers (2020-04-03T16:24:47Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.