Related papers: CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation

CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation

URL: http://arxiv.org/abs/2410.18311v1
Date: Wed, 23 Oct 2024 22:45:23 GMT
Title: CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation
Authors: Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen,
Abstract summary: Large language models (LLMs) with billions of parameters have sparked a new wave of exciting AI applications. Adaptive sparse activation inference, which activates only a small number of neurons for each token, offers a novel way to accelerate model inference. In this paper, we introduce CoreInfer, an adaptive sparse activation inference method based on sentence-level prediction.
Score: 14.823949309351129
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) with billions of parameters have sparked a new wave of exciting AI applications. However, their high computational costs and memory demands during inference pose significant challenges. Adaptive sparse activation inference, which activates only a small number of neurons for each token, offers a novel way to accelerate model inference without degrading performance, showing great potential for resource-constrained hardware devices. Nevertheless, existing methods predict activated neurons based on individual tokens with additional MLP, which involve frequent changes in activation maps and resource calls, limiting the acceleration benefits of sparse activation. In this paper, we introduce CoreInfer, an MLP-free adaptive sparse activation inference method based on sentence-level prediction. Specifically, we propose the concept of sentence-wise core neurons, which refers to the subset of neurons most critical for a given sentence, and empirically demonstrate its effectiveness. To determine the core neurons, we explore the correlation between core neurons and the sentence's semantics. Remarkably, we discovered that core neurons exhibit both stability and similarity in relation to the sentence's semantics -- an insight overlooked by previous studies. Building on this finding, we further design two semantic-based methods for predicting core neurons to fit different input scenarios. In CoreInfer, the core neurons are determined during the pre-filling stage and fixed during the encoding stage, enabling zero-cost sparse inference. We evaluated the model generalization and task generalization of CoreInfer across various models and tasks. Notably, on an NVIDIA TITAN XP GPU, CoreInfer achieved a 10.33 times and 2.72 times speedup compared to the Huggingface implementation and PowerInfer, respectively.

Related papers

Parallel Training in Spiking Neural Networks [47.43408320628711]
The bio-inspired integrate-fire-reset mechanism of spiking neurons constitutes the foundation for efficient processing in Spiking Neural Networks (SNNs)<n>Recent progress in large models demands that spiking neurons support highly parallel computation to scale efficiently on modern GPUs.<n>This work proposes a novel functional perspective that provides general guidance for designing parallel spiking neurons.
arXiv Detail & Related papers (2026-02-01T10:10:47Z)
Language Model Circuits Are Sparse in the Neuron Basis [50.460651620833055]
We show that textbfMLP neurons are as sparse a feature basis as SAEs.<n>This work advances automated interpretability of language models without additional training costs.
arXiv Detail & Related papers (2026-01-30T05:41:19Z)
A Scalable, Causal, and Energy Efficient Framework for Neural Decoding with Spiking Neural Networks [30.855279392147082]
Spikachu is a scalable, causal, and energy-efficient neural decoding framework based on SNNs.<n>We evaluate our approach on 113 recording sessions from 6 non-human primates.<n>Our method outperforms causal baselines when trained on single sessions using between 2.26 and 418.81 times less energy.
arXiv Detail & Related papers (2025-10-23T15:55:45Z)
Fractional Spike Differential Equations Neural Network with Efficient Adjoint Parameters Training [63.3991315762955]
Spiking Neural Networks (SNNs) draw inspiration from biological neurons to create realistic models for brain-like computation.<n>Most existing SNNs assume a single time constant for neuronal membrane voltage dynamics, modeled by first-order ordinary differential equations (ODEs) with Markovian characteristics.<n>We propose the Fractional SPIKE Differential Equation neural network (fspikeDE), which captures long-term dependencies in membrane voltage and spike trains through fractional-order dynamics.
arXiv Detail & Related papers (2025-07-22T18:20:56Z)
CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models [12.277869260176068]
Token sparsity mitigates inefficiencies in token usage, while neuron sparsity reduces high-dimensional computations.<n>Recently, these two sparsity paradigms have evolved largely in parallel, fostering the prevailing assumption that they function independently.<n>We propose CoreMatching, a co-adaptive sparse inference framework, which leverages the synergy between token and neuron sparsity to enhance inference efficiency.
arXiv Detail & Related papers (2025-05-25T17:16:34Z)
Confidence Regulation Neurons in Language Models [91.90337752432075]
This study investigates the mechanisms by which large language models represent and regulate uncertainty in next-token predictions. Entropy neurons are characterized by an unusually high weight norm and influence the final layer normalization (LayerNorm) scale to effectively scale down the logits. token frequency neurons, which we describe here for the first time, boost or suppress each token's logit proportionally to its log frequency, thereby shifting the output distribution towards or away from the unigram distribution.
arXiv Detail & Related papers (2024-06-24T01:31:03Z)
Fast gradient-free activation maximization for neurons in spiking neural networks [5.805438104063613]
We present a framework with an efficient design for such a loop. We track changes in the optimal stimuli for artificial neurons during training. This formation of refined optimal stimuli is associated with an increase in classification accuracy.
arXiv Detail & Related papers (2023-12-28T18:30:13Z)
Cross-Model Comparative Loss for Enhancing Neuronal Utility in Language Understanding [82.46024259137823]
We propose a cross-model comparative loss for a broad range of tasks. We demonstrate the universal effectiveness of comparative loss through extensive experiments on 14 datasets from 3 distinct NLU tasks.
arXiv Detail & Related papers (2023-01-10T03:04:27Z)
STNDT: Modeling Neural Population Activity with a Spatiotemporal Transformer [19.329190789275565]
We introduce SpatioTemporal Neural Data Transformer (STNDT), an NDT-based architecture that explicitly models responses of individual neurons. We show that our model achieves state-of-the-art performance on ensemble level in estimating neural activities across four neural datasets.
arXiv Detail & Related papers (2022-06-09T18:54:23Z)
Training Feedback Spiking Neural Networks by Implicit Differentiation on the Equilibrium State [66.2457134675891]
Spiking neural networks (SNNs) are brain-inspired models that enable energy-efficient implementation on neuromorphic hardware. Most existing methods imitate the backpropagation framework and feedforward architectures for artificial neural networks. We propose a novel training method that does not rely on the exact reverse of the forward computation.
arXiv Detail & Related papers (2021-09-29T07:46:54Z)
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks. We explore the diversity of the neurons within the hidden layer during the learning process. We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z)
And/or trade-off in artificial neurons: impact on adversarial robustness [91.3755431537592]
Presence of sufficient number of OR-like neurons in a network can lead to classification brittleness and increased vulnerability to adversarial attacks. We define AND-like neurons and propose measures to increase their proportion in the network. Experimental results on the MNIST dataset suggest that our approach holds promise as a direction for further exploration.
arXiv Detail & Related papers (2021-02-15T08:19:05Z)
Towards Efficient Processing and Learning with Spikes: New Approaches for Multi-Spike Learning [59.249322621035056]
We propose two new multi-spike learning rules which demonstrate better performance over other baselines on various tasks. In the feature detection task, we re-examine the ability of unsupervised STDP with its limitations being presented. Our proposed learning rules can reliably solve the task over a wide range of conditions without specific constraints being applied.
arXiv Detail & Related papers (2020-05-02T06:41:20Z)
Unifying and generalizing models of neural dynamics during decision-making [27.46508483610472]
We propose a unifying framework for modeling neural activity during decision-making tasks. The framework includes the canonical drift-diffusion model and enables extensions such as multi-dimensional accumulators, variable and collapsing boundaries, and discrete jumps.
arXiv Detail & Related papers (2020-01-13T23:57:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.