Related papers: Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis--Hastings

Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis--Hastings

URL: http://arxiv.org/abs/2106.02736v1
Date: Fri, 4 Jun 2021 22:04:30 GMT
Title: Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis--Hastings
Authors: Kartik Goyal, Chris Dyer, Taylor Berg-Kirkpatrick
Abstract summary: We interpret sequences as energy-based sequence models and propose two energy parametrizations derivable from traineds. We develop a tractable emph scheme based on the Metropolis-Hastings Monte Carlo algorithm. We validate the effectiveness of the proposed parametrizations by exploring the quality of samples drawn from these energy-based models.
Score: 57.133639209759615
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While recent work has shown that scores from models trained by the ubiquitous masked language modeling (MLM) objective effectively discriminate probable and improbable sequences, it is still an open question if these MLMs specify a principled probability distribution over the space of possible sequences. In this paper, we interpret MLMs as energy-based sequence models and propose two energy parametrizations derivable from the trained MLMs. In order to draw samples correctly from these models, we develop a tractable \emph{sampling} scheme based on the Metropolis--Hastings Monte Carlo algorithm. In our approach, samples are proposed from the same masked conditionals used for training the masked language models, and they are accepted or rejected based on their energy values according to the target distribution. We validate the effectiveness of the proposed parametrizations by exploring the quality of samples drawn from these energy-based models on the conditional generation task of machine translation. We theoretically and empirically justify our sampling algorithm by showing that the masked conditionals on their own do not yield a Markov chain whose stationary distribution is that of our target distribution, and our approach generates higher quality samples than other recently proposed undirected generation approaches (Wang et al., 2019, Ghazvininejad et al., 2019).

Related papers

Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts [64.34482582690927]
We provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality.
arXiv Detail & Related papers (2025-03-04T17:46:51Z)
[MASK] is All You Need [28.90875822599164]
We propose using discrete-state models to connect Masked Generative and Non-autoregressive Diffusion models. By leveraging [MASK] in discrete-state models, we can bridge Masked Generative and Non-autoregressive Diffusion models.
arXiv Detail & Related papers (2024-12-09T18:59:56Z)
Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction [88.65168366064061]
We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference. Our framework leads to a family of three novel objectives that are all simulation-free, and thus scalable. We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.
arXiv Detail & Related papers (2024-10-10T17:18:30Z)
Towards Probabilistically-Sound Beam Search with Masked Language Models [0.0]
Beam search masked language models (MLMs) is challenging in part because joint probability over distributions are not available. estimating such distributions has important domain-specific applications such as ancient text restoration and protein engineering. Here we present probabilistically-sound methods for beam search with sequences.
arXiv Detail & Related papers (2024-02-22T23:36:26Z)
A Block Metropolis-Hastings Sampler for Controllable Energy-based Text Generation [78.81021361497311]
We develop a novel Metropolis-Hastings (MH) sampler that proposes re-writes of the entire sequence in each step via iterative prompting of a large language model. Our new sampler allows for more efficient and accurate sampling from a target distribution and (b) allows generation length to be determined through the sampling procedure rather than fixed in advance.
arXiv Detail & Related papers (2023-12-07T18:30:15Z)
Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions. We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training. As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z)
Differentiating Metropolis-Hastings to Optimize Intractable Densities [51.16801956665228]
We develop an algorithm for automatic differentiation of Metropolis-Hastings samplers. We apply gradient-based optimization to objectives expressed as expectations over intractable target densities.
arXiv Detail & Related papers (2023-06-13T17:56:02Z)
Deriving Language Models from Masked Language Models [12.628196757545979]
Masked language models (MLM) do not explicitly define a distribution over language. Recent work has implicitly treated them as such for the purposes of generation and scoring.
arXiv Detail & Related papers (2023-05-24T18:42:45Z)
Inconsistencies in Masked Language Models [20.320583166619528]
Masked language models (MLMs) can provide distributions of tokens in the masked positions in a sequence. distributions corresponding to different masking patterns can demonstrate considerable inconsistencies. We propose an inference-time strategy for fors called Ensemble of Conditionals.
arXiv Detail & Related papers (2022-12-30T22:53:25Z)
Sampling from Discrete Energy-Based Models with Quality/Efficiency Trade-offs [3.491202838583993]
Energy-Based Models (EBMs) allow for extremely flexible specifications of probability distributions. They do not provide a mechanism for obtaining exact samples from these distributions. We propose a new approximate sampling technique, Quasi Rejection Sampling (QRS), that allows for a trade-off between sampling efficiency and sampling quality.
arXiv Detail & Related papers (2021-12-10T17:51:37Z)
Oops I Took A Gradient: Scalable Sampling for Discrete Distributions [53.3142984019796]
We show that this approach outperforms generic samplers in a number of difficult settings. We also demonstrate the use of our improved sampler for training deep energy-based models on high dimensional discrete data.
arXiv Detail & Related papers (2021-02-08T20:08:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.