Beyond MLE: Convex Learning for Text Generation
- URL: http://arxiv.org/abs/2310.17217v1
- Date: Thu, 26 Oct 2023 08:08:43 GMT
- Title: Beyond MLE: Convex Learning for Text Generation
- Authors: Chenze Shao and Zhengrui Ma and Min Zhang and Yang Feng
- Abstract summary: We argue that Maximum likelihood estimation (MLE) is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation.
We propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution.
- Score: 34.99340118597274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Maximum likelihood estimation (MLE) is a statistical method used to estimate
the parameters of a probability distribution that best explain the observed
data. In the context of text generation, MLE is often used to train generative
language models, which can then be used to generate new text. However, we argue
that MLE is not always necessary and optimal, especially for closed-ended text
generation tasks like machine translation. In these tasks, the goal of model is
to generate the most appropriate response, which does not necessarily require
it to estimate the entire data distribution with MLE. To this end, we propose a
novel class of training objectives based on convex functions, which enables
text generation models to focus on highly probable outputs without having to
estimate the entire data distribution. We investigate the theoretical
properties of the optimal predicted distribution when applying convex functions
to the loss, demonstrating that convex functions can sharpen the optimal
distribution, thereby enabling the model to better capture outputs with high
probabilities. Experiments on various text generation tasks and models show the
effectiveness of our approach. It enables autoregressive models to bridge the
gap between greedy and beam search, and facilitates the learning of
non-autoregressive models with a maximum improvement of 9+ BLEU points.
Moreover, our approach also exhibits significant impact on large language
models (LLMs), substantially enhancing their generative capability on various
tasks. Source code is available at
\url{https://github.com/ictnlp/Convex-Learning}.
Related papers
- Make Prompts Adaptable: Bayesian Modeling for Vision-Language Prompt
Learning with Data-Dependent Prior [14.232144691524528]
Recent Vision-Language Pretrained models have become the backbone for many downstream tasks.
MLE training can lead the context vector to over-fit dominant image features in the training data.
This paper presents a Bayesian-based framework of prompt learning, which could alleviate the overfitting issues on few-shot learning application.
arXiv Detail & Related papers (2024-01-09T10:15:59Z) - A Simple yet Efficient Ensemble Approach for AI-generated Text Detection [0.5840089113969194]
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing.
It is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text.
We propose a simple yet efficient solution by ensembling predictions from multiple constituent LLMs.
arXiv Detail & Related papers (2023-11-06T13:11:02Z) - RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder
for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE)
It encodes the text corpus into a latent space, capturing current and future information from both source and target text.
Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method.
We develop practical bounds to apply it to language generation.
We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z) - LaDDer: Latent Data Distribution Modelling with a Generative Prior [21.27563489899532]
We propose LaDDer to achieve accurate modelling of the latent data distribution in a variational autoencoder framework.
LaDDer is a meta-embedding concept, which uses multiple VAE models to learn an embedding of the embeddings.
We show that our LaDDer model is able to accurately estimate complex latent distribution and results in improvement in the representation quality.
arXiv Detail & Related papers (2020-08-31T20:10:01Z) - Improving Maximum Likelihood Training for Text Generation with Density
Ratio Estimation [51.091890311312085]
We propose a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation.
Our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.
arXiv Detail & Related papers (2020-07-12T15:31:24Z) - Expected Information Maximization: Using the I-Projection for Mixture
Density Estimation [22.096148237257644]
Modelling highly multi-modal data is a challenging problem in machine learning.
We present a new algorithm called Expected Information Maximization (EIM) for computing the I-projection.
We show that our algorithm is much more effective in computing the I-projection than recent GAN approaches.
arXiv Detail & Related papers (2020-01-23T17:24:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.