Stochastic activations
- URL: http://arxiv.org/abs/2509.22358v1
- Date: Fri, 26 Sep 2025 13:53:56 GMT
- Title: Stochastic activations
- Authors: Maria Lomeli, Matthijs Douze, Gergely Szilvasy, Loic Cabannes, Jade Copet, Sainbayar Sukhbaatar, Jason Weston, Gabriel Synnaeve, Pierre-Emmanuel Mazaré, Hervé Jégou,
- Abstract summary: This strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model.<n>We leverage this strategy in two ways: (1) We use activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors.<n>This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling.
- Score: 53.40901433014535
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce stochastic activations. This novel strategy randomly selects between several non-linear functions in the feed-forward layer of a large language model. In particular, we choose between SILU or RELU depending on a Bernoulli draw. This strategy circumvents the optimization problem associated with RELU, namely, the constant shape for negative inputs that prevents the gradient flow. We leverage this strategy in two ways: (1) We use stochastic activations during pre-training and fine-tune the model with RELU, which is used at inference time to provide sparse latent vectors. This reduces the inference FLOPs and translates into a significant speedup in the CPU. Interestingly, this leads to much better results than training from scratch with the RELU activation function. (2) We evaluate stochastic activations for generation. This strategy performs reasonably well: it is only slightly inferior to the best deterministic non-linearity, namely SILU combined with temperature scaling. This offers an alternative to existing strategies by providing a controlled way to increase the diversity of the generated text.
Related papers
- Flow-GRPO: Training Flow Matching Models via Online RL [75.70017261794422]
We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models.<n>Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Equation (ODE) into an equivalent Differential Equation (SDE) that matches the original model's marginal distribution at all timesteps; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number.
arXiv Detail & Related papers (2025-05-08T17:58:45Z) - Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL [20.177871969184004]
Chain-of-thought (CoT) reasoning can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps.<n>Prior approaches such as iterative reward-ranked fine-tuning fail to account for variability in difficulty and convergence behavior.<n>We propose GVMRAFT, a prompt-specific Dynamic Sample Allocation Strategy to minimize gradient variance under a computational budget constraint.
arXiv Detail & Related papers (2025-05-05T06:26:00Z) - Gompertz Linear Units: Leveraging Asymmetry for Enhanced Learning Dynamics [39.0860823332923]
GoLU is a novel self-gated activation function defined as $mathrmGoLU(x) = x, mathrmGompertz(x)$, where $mathrmGompertz(x) = e-e-x$.<n>GoLU is superior to state-of-the-art activation functions, establishing GoLU as a robust alternative to existing activation functions.
arXiv Detail & Related papers (2025-02-05T22:32:22Z) - Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training [3.195234044113248]
We propose textscNeuroAL, a emphtop-up algorithm for network pruning.<n>It modifies the block-wise and row-wise sparsity exploiting information from both the dense model and its sparse version.<n>It consistently outperforms the latest state-of-the-art methods in terms of performance-runtime trade-off.
arXiv Detail & Related papers (2024-11-11T15:30:16Z) - Convex Relaxations of ReLU Neural Networks Approximate Global Optima in Polynomial Time [45.72323731094864]
In this paper, we study the optimality gap between two-layer ReLULU networks regularized with weight decay and their convex relaxations.
Our study sheds new light on understanding why local methods work well.
arXiv Detail & Related papers (2024-02-06T01:29:35Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - A Framework for Provably Stable and Consistent Training of Deep
Feedforward Networks [4.21061712600981]
We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios.
This algorithm combines the standard descent gradient and the gradient clipping method.
We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.
arXiv Detail & Related papers (2023-05-20T07:18:06Z) - Smooth over-parameterized solvers for non-smooth structured optimization [3.756550107432323]
Non-smoothness encodes structural constraints on the solutions, such as sparsity, group sparsity, low-rank edges and sharp edges.
We operate a non-weighted but smooth overparametrization of the underlying nonsmooth optimization problems.
Our main contribution is to apply the Variable Projection (VarPro) which defines a new formulation by explicitly minimizing over part of the variables.
arXiv Detail & Related papers (2022-05-03T09:23:07Z) - Improved Convergence Rate of Stochastic Gradient Langevin Dynamics with
Variance Reduction and its Application to Optimization [50.83356836818667]
gradient Langevin Dynamics is one of the most fundamental algorithms to solve non-eps optimization problems.
In this paper, we show two variants of this kind, namely the Variance Reduced Langevin Dynamics and the Recursive Gradient Langevin Dynamics.
arXiv Detail & Related papers (2022-03-30T11:39:00Z) - Sparse Bayesian Learning via Stepwise Regression [1.2691047660244335]
We propose a coordinate ascent algorithm for SBL termed Relevance Matching Pursuit (RMP)
As its noise variance parameter goes to zero, RMP exhibits a surprising connection to Stepwise Regression.
We derive novel guarantees for Stepwise Regression algorithms, which also shed light on RMP.
arXiv Detail & Related papers (2021-06-11T00:20:27Z) - Solving weakly supervised regression problem using low-rank manifold
regularization [77.34726150561087]
We solve a weakly supervised regression problem.
Under "weakly" we understand that for some training points the labels are known, for some unknown, and for others uncertain due to the presence of random noise or other reasons such as lack of resources.
In the numerical section, we applied the suggested method to artificial and real datasets using Monte-Carlo modeling.
arXiv Detail & Related papers (2021-04-13T23:21:01Z) - Gaussian MRF Covariance Modeling for Efficient Black-Box Adversarial
Attacks [86.88061841975482]
We study the problem of generating adversarial examples in a black-box setting, where we only have access to a zeroth order oracle.
We use this setting to find fast one-step adversarial attacks, akin to a black-box version of the Fast Gradient Sign Method(FGSM)
We show that the method uses fewer queries and achieves higher attack success rates than the current state of the art.
arXiv Detail & Related papers (2020-10-08T18:36:51Z) - Preventing Posterior Collapse with Levenshtein Variational Autoencoder [61.30283661804425]
We propose to replace the evidence lower bound (ELBO) with a new objective which is simple to optimize and prevents posterior collapse.
We show that Levenstein VAE produces more informative latent representations than alternative approaches to preventing posterior collapse.
arXiv Detail & Related papers (2020-04-30T13:27:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.