Towards Probabilistically-Sound Beam Search with Masked Language Models
- URL: http://arxiv.org/abs/2402.15020v3
- Date: Thu, 10 Oct 2024 06:34:02 GMT
- Title: Towards Probabilistically-Sound Beam Search with Masked Language Models
- Authors: Creston Brooks, Robert Calef, Charlie Cowen-Breen, Anna Sappington,
- Abstract summary: Beam search masked language models (MLMs) is challenging in part because joint probability over distributions are not available.
estimating such distributions has important domain-specific applications such as ancient text restoration and protein engineering.
Here we present probabilistically-sound methods for beam search with sequences.
- Score: 0.0
- License:
- Abstract: Beam search with masked language models (MLMs) is challenging in part because joint probability distributions over sequences are not readily available, unlike for autoregressive models. However, estimating such distributions has important domain-specific applications such as ancient text restoration and protein engineering. Here we present probabilistically-sound methods for beam search with MLMs. First, we clarify the conditions under which it is theoretically sound to perform text infilling with MLMs using standard beam search. When these conditions fail, we provide a probabilistically-sound inference time modification with no additional computational complexity and demonstrate that it is superior to the aforementioned beam search in the expected conditions. We then present empirical results comparing several infilling approaches with MLMs across several domains. Notably, our method probes the inductive biases of MLMs and explores the surprising contextual sensitivity of mask tokens for text infilling.
Related papers
- ExLM: Rethinking the Impact of [MASK] Tokens in Masked Language Models [11.997499811414837]
Masked Language Models (ML)Mss are trained by randomly masking portions of the input sequences with [MASK] tokens and learning to reconstruct the original content based on the remaining context.
arXiv Detail & Related papers (2025-01-23T05:46:50Z) - Uncertainty-Aware Hybrid Inference with On-Device Small and Remote Large Language Models [49.48313161005423]
A hybrid language model (HLM) architecture integrates a small language model (SLM) operating on a mobile device with a large language model (LLM) hosted at the base station (BS) of a wireless network.
The HLM token generation process follows the speculative inference principle: the SLM's vocabulary distribution is uploaded to the LLM, which either accepts or rejects it, with rejected tokens being resampled by the LLM.
We propose a novel HLM structure coined Uncertainty-aware opportunistic HLM (U-HLM), wherein the SLM locally measures its output uncertainty and skips both up
arXiv Detail & Related papers (2024-12-17T09:08:18Z) - Forking Paths in Neural Text Generation [14.75166317633176]
We develop a novel approach to representing uncertainty dynamics across individual tokens of text generation.
We use our method to analyze LLM responses on 7 different tasks across 4 domains.
We find many examples of forking tokens, including surprising ones such as punctuation marks.
arXiv Detail & Related papers (2024-12-10T22:57:57Z) - Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation [73.58618024960968]
An increasing number of studies are employing large language models (LLMs) as agents to emulate the sequential decision-making processes of humans.
This arouses curiosity regarding the capacity of LLM agents to comprehend probability distributions.
Our analysis indicates that LLM agents can understand probabilities, but they struggle with probability sampling.
arXiv Detail & Related papers (2024-04-13T16:59:28Z) - Which Syntactic Capabilities Are Statistically Learned by Masked
Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities.
To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z) - Deriving Language Models from Masked Language Models [12.628196757545979]
Masked language models (MLM) do not explicitly define a distribution over language.
Recent work has implicitly treated them as such for the purposes of generation and scoring.
arXiv Detail & Related papers (2023-05-24T18:42:45Z) - Inconsistencies in Masked Language Models [20.320583166619528]
Masked language models (MLMs) can provide distributions of tokens in the masked positions in a sequence.
distributions corresponding to different masking patterns can demonstrate considerable inconsistencies.
We propose an inference-time strategy for fors called Ensemble of Conditionals.
arXiv Detail & Related papers (2022-12-30T22:53:25Z) - Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models [65.52639709094963]
Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize.
We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model.
arXiv Detail & Related papers (2022-10-18T22:19:41Z) - Exposing the Implicit Energy Networks behind Masked Language Models via
Metropolis--Hastings [57.133639209759615]
We interpret sequences as energy-based sequence models and propose two energy parametrizations derivable from traineds.
We develop a tractable emph scheme based on the Metropolis-Hastings Monte Carlo algorithm.
We validate the effectiveness of the proposed parametrizations by exploring the quality of samples drawn from these energy-based models.
arXiv Detail & Related papers (2021-06-04T22:04:30Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.