Entropy-Guided Attention for Private LLMs
- URL: http://arxiv.org/abs/2501.03489v2
- Date: Wed, 08 Jan 2025 22:22:43 GMT
- Title: Entropy-Guided Attention for Private LLMs
- Authors: Nandan Kumar Jha, Brandon Reagen,
- Abstract summary: We introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models.<n>By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities.<n>We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload.
- Score: 3.7802450241986945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pervasiveness of proprietary language models has raised critical privacy concerns, necessitating advancements in private inference (PI), where computations are performed directly on encrypted data without revealing users' sensitive information. While PI offers a promising solution, its practical deployment is hindered by substantial communication and latency overheads, primarily stemming from nonlinear operations. To address this, we introduce an information-theoretic framework to characterize the role of nonlinearities in decoder-only language models, laying a principled foundation for optimizing transformer-architectures tailored to the demands of PI. By leveraging Shannon's entropy as a quantitative measure, we uncover the previously unexplored dual significance of nonlinearities: beyond ensuring training stability, they are crucial for maintaining attention head diversity. Specifically, we find that their removal triggers two critical failure modes: {\em entropy collapse} in deeper layers that destabilizes training, and {\em entropic overload} in earlier layers that leads to under-utilization of Multi-Head Attention's (MHA) representational capacity. We propose an entropy-guided attention mechanism paired with a novel entropy regularization technique to mitigate entropic overload. Additionally, we explore PI-friendly alternatives to layer normalization for preventing entropy collapse and stabilizing the training of LLMs with reduced-nonlinearities. Our study bridges the gap between information theory and architectural design, establishing entropy dynamics as a principled guide for developing efficient PI architectures. The code and implementation are available at https://github.com/Nandan91/entropy-guided-attention-llm
Related papers
- UP-dROM : Uncertainty-Aware and Parametrised dynamic Reduced-Order Model, application to unsteady flows [27.50487430169627]
Reduced order models (ROMs) play a critical role in fluid mechanics by providing low-cost predictions.
For ROMs to be widely applicable, they must not only generalise well across different regimes, but also provide a measure of confidence in their predictions.
We present a nonlinear reduction strategy specifically designed for transient flows.
arXiv Detail & Related papers (2025-03-29T22:17:36Z) - Paving the way for scientific foundation models: enhancing generalization and robustness in PDEs with constraint-aware pre-training [49.8035317670223]
A scientific foundation model (SciFM) is emerging as a promising tool for learning transferable representations across diverse domains.
We propose incorporating PDE residuals into pre-training either as the sole learning signal or in combination with data loss to compensate for limited or infeasible training data.
Our results show that pre-training with PDE constraints significantly enhances generalization, outperforming models trained solely on solution data.
arXiv Detail & Related papers (2025-03-24T19:12:39Z) - Encrypted Large Model Inference: The Equivariant Encryption Paradigm [18.547945807599543]
We introduce Equivariant Encryption (EE), a novel paradigm designed to enable secure, "blind" inference on encrypted data with near zero performance overhead.
Unlike fully homomorphic approaches that encrypt the entire computational graph, EE selectively obfuscates critical internal representations within neural network layers.
EE maintains high fidelity and throughput, effectively bridging the gap between robust data confidentiality and the stringent efficiency requirements of modern, large scale model inference.
arXiv Detail & Related papers (2025-02-03T03:05:20Z) - Physics-Informed Latent Neural Operator for Real-time Predictions of Complex Physical Systems [0.0]
Deep operator network (DeepONet) has shown great promise as a surrogate model for systems governed by partial differential equations (PDEs)
This work introduces PI-Latent-NO, a physics-informed latent operator learning framework that overcomes limitations.
arXiv Detail & Related papers (2025-01-14T20:38:30Z) - ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models [3.7802450241986945]
LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization.
This work explores desirable activation functions in normalization-free decoder-only LLMs.
ReLU significantly outperforms GELU in LayerNorm-free models, leading to an bf 8.2% perplexity improvement.
arXiv Detail & Related papers (2024-10-12T20:26:01Z) - Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization [77.62516752323207]
We introduce an orthogonal fine-tuning method for efficiently fine-tuning pretrained weights and enabling enhanced robustness and generalization.
A self-regularization strategy is further exploited to maintain the stability in terms of zero-shot generalization of VLMs, dubbed OrthSR.
For the first time, we revisit the CLIP and CoOp with our method to effectively improve the model on few-shot image classficiation scenario.
arXiv Detail & Related papers (2024-07-11T10:35:53Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Assessing Neural Network Representations During Training Using
Noise-Resilient Diffusion Spectral Entropy [55.014926694758195]
Entropy and mutual information in neural networks provide rich information on the learning process.
We leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures.
We show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data.
arXiv Detail & Related papers (2023-12-04T01:32:42Z) - Provably Efficient Algorithm for Nonstationary Low-Rank MDPs [48.92657638730582]
We make the first effort to investigate nonstationary RL under episodic low-rank MDPs, where both transition kernels and rewards may vary over time.
We propose a parameter-dependent policy optimization algorithm called PORTAL, and further improve PORTAL to its parameter-free version of Ada-PORTAL.
For both algorithms, we provide upper bounds on the average dynamic suboptimality gap, which show that as long as the nonstationarity is not significantly large, PORTAL and Ada-PORTAL are sample-efficient and can achieve arbitrarily small average dynamic suboptimality gap with sample complexity.
arXiv Detail & Related papers (2023-08-10T09:52:44Z) - Residual-based attention and connection to information bottleneck theory
in PINNs [0.393259574660092]
Physics-informed neural networks (PINNs) have seen a surge of interest in recent years.
We propose an efficient, gradient-less weighting scheme for PINNs, that accelerates the convergence of dynamic or static systems.
arXiv Detail & Related papers (2023-07-01T16:29:55Z) - Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline
Reinforcement Learning [114.36124979578896]
We design a dynamic mechanism using offline reinforcement learning algorithms.
Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set.
arXiv Detail & Related papers (2022-05-05T05:44:26Z) - A purely data-driven framework for prediction, optimization, and control
of networked processes: application to networked SIS epidemic model [0.8287206589886881]
We develop a data-driven framework based on operator-theoretic techniques to identify and control nonlinear dynamics over large-scale networks.
The proposed approach requires no prior knowledge of the network structure and identifies the underlying dynamics solely using a collection of two-step snapshots of the states.
arXiv Detail & Related papers (2021-08-01T03:57:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.