On the Discrepancy between Density Estimation and Sequence Generation
- URL: http://arxiv.org/abs/2002.07233v1
- Date: Mon, 17 Feb 2020 20:13:35 GMT
- Title: On the Discrepancy between Density Estimation and Sequence Generation
- Authors: Jason Lee, Dustin Tran, Orhan Firat, Kyunghyun Cho
- Abstract summary: log-likelihood is highly correlated with BLEU when we consider models within the same family.
We observe no correlation between rankings of models across different families.
- Score: 92.70116082182076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many sequence-to-sequence generation tasks, including machine translation and
text-to-speech, can be posed as estimating the density of the output y given
the input x: p(y|x). Given this interpretation, it is natural to evaluate
sequence-to-sequence models using conditional log-likelihood on a test set.
However, the goal of sequence-to-sequence generation (or structured prediction)
is to find the best output y^ given an input x, and each task has its own
downstream metric R that scores a model output by comparing against a set of
references y*: R(y^, y* | x). While we hope that a model that excels in density
estimation also performs well on the downstream metric, the exact correlation
has not been studied for sequence generation tasks. In this paper, by comparing
several density estimators on five machine translation tasks, we find that the
correlation between rankings of models based on log-likelihood and BLEU varies
significantly depending on the range of the model families being compared.
First, log-likelihood is highly correlated with BLEU when we consider models
within the same family (e.g. autoregressive models, or latent variable models
with the same parameterization of the prior). However, we observe no
correlation between rankings of models across different families: (1) among
non-autoregressive latent variable models, a flexible prior distribution is
better at density estimation but gives worse generation quality than a simple
prior, and (2) autoregressive models offer the best translation performance
overall, while latent variable models with a normalizing flow prior give the
highest held-out log-likelihood across all datasets. Therefore, we recommend
using a simple prior for the latent variable non-autoregressive model when fast
generation speed is desired.
Related papers
- SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z) - Anomaly Detection of Time Series with Smoothness-Inducing Sequential
Variational Auto-Encoder [59.69303945834122]
We present a Smoothness-Inducing Sequential Variational Auto-Encoder (SISVAE) model for robust estimation and anomaly detection of time series.
Our model parameterizes mean and variance for each time-stamp with flexible neural networks.
We show the effectiveness of our model on both synthetic datasets and public real-world benchmarks.
arXiv Detail & Related papers (2021-02-02T06:15:15Z) - On Maximum Likelihood Training of Score-Based Generative Models [17.05208572228308]
We show that an objective is equivalent to maximum likelihood for certain choices of mixture weighting.
We show that both maximum likelihood training and test-time log-likelihood evaluation can be achieved through parameterization of the score function alone.
arXiv Detail & Related papers (2021-01-22T18:22:29Z) - Autoregressive Score Matching [113.4502004812927]
We propose autoregressive conditional score models (AR-CSM) where we parameterize the joint distribution in terms of the derivatives of univariable log-conditionals (scores)
For AR-CSM models, this divergence between data and model distributions can be computed and optimized efficiently, requiring no expensive sampling or adversarial training.
We show with extensive experimental results that it can be applied to density estimation on synthetic data, image generation, image denoising, and training latent variable models with implicit encoders.
arXiv Detail & Related papers (2020-10-24T07:01:24Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z) - Variational Mixture of Normalizing Flows [0.0]
Deep generative models, such as generative adversarial networks autociteGAN, variational autoencoders autocitevaepaper, and their variants, have seen wide adoption for the task of modelling complex data distributions.
Normalizing flows have overcome this limitation by leveraging the change-of-suchs formula for probability density functions.
The present work overcomes this by using normalizing flows as components in a mixture model and devising an end-to-end training procedure for such a model.
arXiv Detail & Related papers (2020-09-01T17:20:08Z) - Pattern Similarity-based Machine Learning Methods for Mid-term Load
Forecasting: A Comparative Study [0.0]
We use pattern similarity-based methods for forecasting monthly electricity demand expressing annual seasonality.
An integral part of the models is the time series representation using patterns of time series sequences.
We consider four such models: nearest neighbor model, fuzzy neighborhood model, kernel regression model and general regression neural network.
arXiv Detail & Related papers (2020-03-03T12:14:36Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.