Tailoring Language Generation Models under Total Variation Distance
- URL: http://arxiv.org/abs/2302.13344v1
- Date: Sun, 26 Feb 2023 16:32:52 GMT
- Title: Tailoring Language Generation Models under Total Variation Distance
- Authors: Haozhe Ji, Pei Ke, Zhipeng Hu, Rongsheng Zhang, Minlie Huang
- Abstract summary: The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method.
We develop practical bounds to apply it to language generation.
We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
- Score: 55.89964205594829
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The standard paradigm of neural language generation adopts maximum likelihood
estimation (MLE) as the optimizing method. From a distributional view, MLE in
fact minimizes the Kullback-Leibler divergence (KLD) between the distribution
of the real data and that of the model. However, this approach forces the model
to distribute non-zero (sometimes large) probability mass to all training
samples regardless of their quality. Moreover, in the attempt to cover the
low-probability regions in the data distribution, the model systematically
overestimates the probability of corrupted text sequences, which we conjecture
is one of the main reasons for text degeneration during autoregressive
decoding. To remedy this problem, we leverage the total variation distance
(TVD) with its robustness to outliers, and develop practical bounds to apply it
to language generation. Then, we introduce the TaiLr objective that balances
the tradeoff of estimating TVD. Intuitively, TaiLr downweights real data
samples that have low model probabilities with tunable penalization intensity.
Experimental results show that our method alleviates the overestimation of
degenerated sequences without sacrificing diversity and improves generation
quality on a wide range of text generation tasks.
Related papers
- Robust training of implicit generative models for multivariate and heavy-tailed distributions with an invariant statistical loss [0.4249842620609682]
We build on the textitinvariant statistical loss (ISL) method introduced in citede2024training.
We extend it to handle heavy-tailed and multivariate data distributions.
We assess its performance in generative generative modeling and explore its potential as a pretraining technique for generative adversarial networks (GANs)
arXiv Detail & Related papers (2024-10-29T10:27:50Z) - Diffusion models for probabilistic programming [56.47577824219207]
Diffusion Model Variational Inference (DMVI) is a novel method for automated approximate inference in probabilistic programming languages (PPLs)
DMVI is easy to implement, allows hassle-free inference in PPLs without the drawbacks of, e.g., variational inference using normalizing flows, and does not make any constraints on the underlying neural network model.
arXiv Detail & Related papers (2023-11-01T12:17:05Z) - Beyond MLE: Convex Learning for Text Generation [34.99340118597274]
We argue that Maximum likelihood estimation (MLE) is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation.
We propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution.
arXiv Detail & Related papers (2023-10-26T08:08:43Z) - Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution [67.9215891673174]
We propose score entropy as a novel loss that naturally extends score matching to discrete spaces.
We test our Score Entropy Discrete Diffusion models on standard language modeling tasks.
arXiv Detail & Related papers (2023-10-25T17:59:12Z) - Learning Multivariate CDFs and Copulas using Tensor Factorization [39.24470798045442]
Learning the multivariate distribution of data is a core challenge in statistics and machine learning.
In this work, we aim to learn multivariate cumulative distribution functions (CDFs), as they can handle mixed random variables.
We show that any grid sampled version of a joint CDF of mixed random variables admits a universal representation as a naive Bayes model.
We demonstrate the superior performance of the proposed model in several synthetic and real datasets and applications including regression, sampling and data imputation.
arXiv Detail & Related papers (2022-10-13T16:18:46Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Evaluating Distributional Distortion in Neural Language Modeling [81.83408583979745]
A heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language.
Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate.
We develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages.
arXiv Detail & Related papers (2022-03-24T01:09:46Z) - Improving Maximum Likelihood Training for Text Generation with Density
Ratio Estimation [51.091890311312085]
We propose a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation.
Our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.
arXiv Detail & Related papers (2020-07-12T15:31:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.