Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM
- URL: http://arxiv.org/abs/2411.01610v1
- Date: Sun, 03 Nov 2024 15:31:44 GMT
- Title: Explaining and Improving Contrastive Decoding by Extrapolating the Probabilities of a Huge and Hypothetical LM
- Authors: Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, Tagyoung Chung,
- Abstract summary: Contrastive decoding (CD) improves the next-token distribution of a large expert language model (LM) using a small amateur LM.
We propose a new unsupervised decoding method called $mathbfA$symptotic $mathbfP$robability $mathbfD$ecoding (APD)
APD explicitly extrapolates the probability curves from the LMs of different sizes to infer the probabilities from an infinitely large LM without inducing more inference costs than CD.
- Score: 93.8400683020273
- License:
- Abstract: Contrastive decoding (CD) (Li et al., 2023) improves the next-token distribution of a large expert language model (LM) using a small amateur LM. Although CD is applied to various LMs and domains to enhance open-ended text generation, it is still unclear why CD often works well, when it could fail, and how we can make it better. To deepen our understanding of CD, we first theoretically prove that CD could be viewed as linearly extrapolating the next-token logits from a huge and hypothetical LM. We also highlight that the linear extrapolation could make CD unable to output the most obvious answers that have already been assigned high probabilities by the amateur LM. To overcome CD's limitation, we propose a new unsupervised decoding method called $\mathbf{A}$symptotic $\mathbf{P}$robability $\mathbf{D}$ecoding (APD). APD explicitly extrapolates the probability curves from the LMs of different sizes to infer the asymptotic probabilities from an infinitely large LM without inducing more inference costs than CD. In FactualityPrompts, an open-ended text generation benchmark, sampling using APD significantly boosts factuality in comparison to the CD sampling and its variants, and achieves state-of-the-art results for Pythia 6.9B and OPT 6.7B. Furthermore, in five commonsense QA datasets, APD is often significantly better than CD and achieves a similar effect of using a larger LLM. For example, the perplexity of APD on top of Pythia 6.9B is even lower than the perplexity of Pythia 12B in CommonsenseQA and LAMBADA.
Related papers
- $\mathbb{USCD}$: Improving Code Generation of LLMs by Uncertainty-Aware Selective Contrastive Decoding [64.00025564372095]
Large language models (LLMs) have shown remarkable capabilities in code generation.
The effects of hallucinations (e.g., output noise) make it challenging for LLMs to generate high-quality code in one pass.
We propose a simple and effective textbfuncertainty-aware textbfselective textbfcontrastive textbfdecoding.
arXiv Detail & Related papers (2024-09-09T02:07:41Z) - Optimal Multi-Distribution Learning [88.3008613028333]
Multi-distribution learning seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions.
We propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon2.
arXiv Detail & Related papers (2023-12-08T16:06:29Z) - Contrastive Decoding: Open-ended Text Generation as Optimization [153.35961722855686]
We propose contrastive decoding (CD), a reliable decoding approach.
It is inspired by the fact that the failures of larger LMs are even more prevalent in smaller LMs.
CD requires zero additional training, and produces higher quality text than decoding from the larger LM alone.
arXiv Detail & Related papers (2022-10-27T00:58:21Z) - On Best-Arm Identification with a Fixed Budget in Non-Parametric
Multi-Armed Bandits [0.0]
We consider general, possibly non-parametric, models D for distributions over the arms.
We propose upper bounds on the average log-probability of misidentifying the optimal arm based on information-theoretic quantities.
arXiv Detail & Related papers (2022-09-30T10:55:40Z) - Density-aware Chamfer Distance as a Comprehensive Metric for Point Cloud
Completion [90.26652899910019]
Chamfer Distance (CD) and Earth Mover's Distance (EMD) are two broadly adopted metrics for measuring the similarity between two point sets.
We propose a new similarity measure named Density-aware Chamfer Distance (DCD)
We show that DCD pays attention to both the overall structure and local details and provides a more reliable evaluation even when CD and contradict each other.
arXiv Detail & Related papers (2021-11-24T18:56:27Z) - Federated Deep AUC Maximization for Heterogeneous Data with a Constant
Communication Complexity [77.78624443410216]
We propose improved FDAM algorithms for detecting heterogeneous chest data.
A result of this paper is that the communication of the proposed algorithm is strongly independent of the number of machines and also independent of the accuracy level.
Experiments have demonstrated the effectiveness of our FDAM algorithm on benchmark datasets and on medical chest Xray images from different organizations.
arXiv Detail & Related papers (2021-02-09T04:05:19Z) - CD-split and HPD-split: efficient conformal regions in high dimensions [3.1690891866882236]
We provide new insights on CD-split by exploring its theoretical properties.
We show that CD-split converges to the highest predictive density set and satisfies local variation and conditional validity.
We introduce HPD-split, a method of CD-split that requires less tuning, and show that it shares the same theoretical guarantees as CD-split.
arXiv Detail & Related papers (2020-07-24T21:42:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.