Beyond MLE: Convex Learning for Text Generation
- URL: http://arxiv.org/abs/2310.17217v1
- Date: Thu, 26 Oct 2023 08:08:43 GMT
- Title: Beyond MLE: Convex Learning for Text Generation
- Authors: Chenze Shao and Zhengrui Ma and Min Zhang and Yang Feng
- Abstract summary: We argue that Maximum likelihood estimation (MLE) is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation.
We propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution.
- Score: 34.99340118597274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Maximum likelihood estimation (MLE) is a statistical method used to estimate
the parameters of a probability distribution that best explain the observed
data. In the context of text generation, MLE is often used to train generative
language models, which can then be used to generate new text. However, we argue
that MLE is not always necessary and optimal, especially for closed-ended text
generation tasks like machine translation. In these tasks, the goal of model is
to generate the most appropriate response, which does not necessarily require
it to estimate the entire data distribution with MLE. To this end, we propose a
novel class of training objectives based on convex functions, which enables
text generation models to focus on highly probable outputs without having to
estimate the entire data distribution. We investigate the theoretical
properties of the optimal predicted distribution when applying convex functions
to the loss, demonstrating that convex functions can sharpen the optimal
distribution, thereby enabling the model to better capture outputs with high
probabilities. Experiments on various text generation tasks and models show the
effectiveness of our approach. It enables autoregressive models to bridge the
gap between greedy and beam search, and facilitates the learning of
non-autoregressive models with a maximum improvement of 9+ BLEU points.
Moreover, our approach also exhibits significant impact on large language
models (LLMs), substantially enhancing their generative capability on various
tasks. Source code is available at
\url{https://github.com/ictnlp/Convex-Learning}.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method.
We develop practical bounds to apply it to language generation.
We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z) - LaDDer: Latent Data Distribution Modelling with a Generative Prior [21.27563489899532]
We propose LaDDer to achieve accurate modelling of the latent data distribution in a variational autoencoder framework.
LaDDer is a meta-embedding concept, which uses multiple VAE models to learn an embedding of the embeddings.
We show that our LaDDer model is able to accurately estimate complex latent distribution and results in improvement in the representation quality.
arXiv Detail & Related papers (2020-08-31T20:10:01Z) - Expected Information Maximization: Using the I-Projection for Mixture
Density Estimation [22.096148237257644]
Modelling highly multi-modal data is a challenging problem in machine learning.
We present a new algorithm called Expected Information Maximization (EIM) for computing the I-projection.
We show that our algorithm is much more effective in computing the I-projection than recent GAN approaches.
arXiv Detail & Related papers (2020-01-23T17:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.