Related papers: Beyond MLE: Convex Learning for Text Generation

Beyond MLE: Convex Learning for Text Generation

URL: http://arxiv.org/abs/2310.17217v1
Date: Thu, 26 Oct 2023 08:08:43 GMT
Title: Beyond MLE: Convex Learning for Text Generation
Authors: Chenze Shao and Zhengrui Ma and Min Zhang and Yang Feng
Abstract summary: We argue that Maximum likelihood estimation (MLE) is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation. We propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution.
Score: 34.99340118597274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Maximum likelihood estimation (MLE) is a statistical method used to estimate the parameters of a probability distribution that best explain the observed data. In the context of text generation, MLE is often used to train generative language models, which can then be used to generate new text. However, we argue that MLE is not always necessary and optimal, especially for closed-ended text generation tasks like machine translation. In these tasks, the goal of model is to generate the most appropriate response, which does not necessarily require it to estimate the entire data distribution with MLE. To this end, we propose a novel class of training objectives based on convex functions, which enables text generation models to focus on highly probable outputs without having to estimate the entire data distribution. We investigate the theoretical properties of the optimal predicted distribution when applying convex functions to the loss, demonstrating that convex functions can sharpen the optimal distribution, thereby enabling the model to better capture outputs with high probabilities. Experiments on various text generation tasks and models show the effectiveness of our approach. It enables autoregressive models to bridge the gap between greedy and beam search, and facilitates the learning of non-autoregressive models with a maximum improvement of 9+ BLEU points. Moreover, our approach also exhibits significant impact on large language models (LLMs), substantially enhancing their generative capability on various tasks. Source code is available at \url{https://github.com/ictnlp/Convex-Learning}.

Related papers

SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z)
Efficient Response Generation Strategy Selection for Fine-Tuning Large Language Models Through Self-Aligned Perplexity [28.717420152590204]
Fine-tuning large language models (LLMs) typically relies on producing large sets of input-output pairs. Recent findings show that how these training outputs are generated can significantly affect the performance of the fine-tuned model. This paper proposes a scalable approximate method that assesses a small subset of generated data to estimate its suitability for a specific target LLM.
arXiv Detail & Related papers (2025-02-17T13:14:11Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs) We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model. We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z)
Zero-shot LLM-guided Counterfactual Generation: A Case Study on NLP Model Evaluation [15.254775341371364]
We explore the possibility of leveraging large language models for zero-shot counterfactual generation. We propose a structured pipeline to facilitate this generation, and we hypothesize that the instruction-following and textual understanding capabilities of recent LLMs can be effectively leveraged.
arXiv Detail & Related papers (2024-05-08T03:57:45Z)
Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions. We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z)
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models. We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z)
Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners. We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting. Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z)
Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method. We develop practical bounds to apply it to language generation. We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z)
Expected Information Maximization: Using the I-Projection for Mixture Density Estimation [22.096148237257644]
Modelling highly multi-modal data is a challenging problem in machine learning. We present a new algorithm called Expected Information Maximization (EIM) for computing the I-projection. We show that our algorithm is much more effective in computing the I-projection than recent GAN approaches.
arXiv Detail & Related papers (2020-01-23T17:24:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.