Top-$nσ$: Not All Logits Are You Need
- URL: http://arxiv.org/abs/2411.07641v1
- Date: Tue, 12 Nov 2024 08:46:43 GMT
- Title: Top-$nσ$: Not All Logits Are You Need
- Authors: Chenxia Tang, Jianchun Liu, Hongli Xu, Liusheng Huang,
- Abstract summary: We introduce top-$nsigma$, a novel sampling method that operates directly on pre-softmax logits.
We show that top-$nsigma$ maintains a stable sampling space regardless of temperature scaling.
We also provide a theoretical analysis of top-$nsigma$ to better understand its behavior.
- Score: 25.133593066927794
- License:
- Abstract: Large language models (LLMs) typically employ greedy decoding or low-temperature sampling for reasoning tasks, reflecting a perceived trade-off between diversity and accuracy. We challenge this convention by introducing top-$n\sigma$, a novel sampling method that operates directly on pre-softmax logits by leveraging a statistical threshold. Our key insight is that logits naturally separate into a Gaussian-distributed noisy region and a distinct informative region, enabling efficient token filtering without complex probability manipulations. Unlike existing methods (e.g., top-$p$, min-$p$) that inadvertently include more noise tokens at higher temperatures, top-$n\sigma$ maintains a stable sampling space regardless of temperature scaling. We also provide a theoretical analysis of top-$n\sigma$ to better understand its behavior. The extensive experimental results across four reasoning-focused datasets demonstrate that our method not only outperforms existing sampling approaches but also surpasses greedy decoding, while maintaining consistent performance even at high temperatures.
Related papers
- Quasi-Bayesian sequential deconvolution [7.10052009802944]
We develop a principled sequential approach to estimate $f$ in a streaming or online domain.
Local and uniform Gaussian central limit theorems for $f_n$ are established, leading to credible intervals and bands for $f$.
An empirical validation of our methods is presented on synthetic and real data.
arXiv Detail & Related papers (2024-08-26T16:40:04Z) - Mini-batch Submodular Maximization [5.439020425819001]
We present the first mini-batch algorithm for maximizing a monotone decomposable submodular function, $F=sum_i=1N fi$, under a set of constraints.
We consider two sampling approaches: uniform and weighted.
Surprisingly, our experimental results show that uniform sampling is superior to weighted sampling.
arXiv Detail & Related papers (2024-01-23T04:16:58Z) - KL-Divergence Guided Temperature Sampling [5.726259957909055]
As temperature increases, the prediction becomes diverse but also vulnerable to hallucinations.
One common approach to mitigate hallucinations is to provide source/grounding documents.
We propose to relax the constraint of having a fixed temperature over decoding steps, and a mechanism to guide the dynamic temperature according to its relevance to the source.
arXiv Detail & Related papers (2023-06-02T06:11:26Z) - Not All Semantics are Created Equal: Contrastive Self-supervised
Learning with Automatic Temperature Individualization [51.41175648612714]
We propose a new robust contrastive loss inspired by distributionally robust optimization (DRO)
We show that our algorithm automatically learns a suitable $tau$ for each sample.
Our method outperforms prior strong baselines on unimodal and bimodal datasets.
arXiv Detail & Related papers (2023-05-19T19:25:56Z) - Convergence for score-based generative modeling with polynomial
complexity [9.953088581242845]
We prove the first convergence guarantees for the core mechanic behind Score-based generative modeling.
Compared to previous works, we do not incur error that grows exponentially in time or that suffers from a curse of dimensionality.
We show that a predictor-corrector gives better convergence than using either portion alone.
arXiv Detail & Related papers (2022-06-13T14:57:35Z) - Sharper Rates and Flexible Framework for Nonconvex SGD with Client and
Data Sampling [64.31011847952006]
We revisit the problem of finding an approximately stationary point of the average of $n$ smooth and possibly non-color functions.
We generalize the $smallsfcolorgreen$ so that it can provably work with virtually any sampling mechanism.
We provide the most general and most accurate analysis of optimal bound in the smooth non-color regime.
arXiv Detail & Related papers (2022-06-05T21:32:33Z) - Approximate Function Evaluation via Multi-Armed Bandits [51.146684847667125]
We study the problem of estimating the value of a known smooth function $f$ at an unknown point $boldsymbolmu in mathbbRn$, where each component $mu_i$ can be sampled via a noisy oracle.
We design an instance-adaptive algorithm that learns to sample according to the importance of each coordinate, and with probability at least $1-delta$ returns an $epsilon$ accurate estimate of $f(boldsymbolmu)$.
arXiv Detail & Related papers (2022-03-18T18:50:52Z) - Towards Sample-Optimal Compressive Phase Retrieval with Sparse and
Generative Priors [59.33977545294148]
We show that $O(k log L)$ samples suffice to guarantee that the signal is close to any vector that minimizes an amplitude-based empirical loss function.
We adapt this result to sparse phase retrieval, and show that $O(s log n)$ samples are sufficient for a similar guarantee when the underlying signal is $s$-sparse and $n$-dimensional.
arXiv Detail & Related papers (2021-06-29T12:49:54Z) - Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy
Amplification by Shuffling [49.43288037509783]
We show that random shuffling amplifies differential privacy guarantees of locally randomized data.
Our result is based on a new approach that is simpler than previous work and extends to approximate differential privacy with nearly the same guarantees.
arXiv Detail & Related papers (2020-12-23T17:07:26Z) - Least Squares Regression with Markovian Data: Fundamental Limits and
Algorithms [69.45237691598774]
We study the problem of least squares linear regression where the data-points are dependent and are sampled from a Markov chain.
We establish sharp information theoretic minimax lower bounds for this problem in terms of $tau_mathsfmix$.
We propose an algorithm based on experience replay--a popular reinforcement learning technique--that achieves a significantly better error rate.
arXiv Detail & Related papers (2020-06-16T04:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.