NoiseFormer -- Noise Diffused Symmetric Attention Transformer
- URL: http://arxiv.org/abs/2601.11619v1
- Date: Sat, 10 Jan 2026 14:10:48 GMT
- Title: NoiseFormer -- Noise Diffused Symmetric Attention Transformer
- Authors: Phani Kumar, Nyshadham, Jyothendra Varma, Polisetty V R K, Aditya Rathore,
- Abstract summary: We propose a novel unified model architecture called Noise Diffused Symmetric Attention Transformer to enhance the model's performance.<n>The proposed model is validated upon GPT2 base model and the results reflect the performance gains falling between plain Symmetric attention and GPT2 base model.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer architecture has been very successful long runner in the field of Deep Learning (DL) and Large Language Models (LLM) because of its powerful attention-based learning and parallel-natured architecture. As the models grow gigantic in terms of memory footprint, difficulties in fitting the model on a device like a GPU or an AI accelerator give rise to the need for multiple computing devices thereby escalating the computing cost. This increased training/inference cost paved the way for efficient model size reduction/parametric reduction deploying Sparse Attention techniques. In this paper, we start analyzing one of the techniques of Sparse Attention called Symmetric Dot-Product Attention (referred to as Symmetric Attention) and propose a novel unified model architecture called Noise Diffused Symmetric Attention Transformer to enhance the model's performance. While maintaining the memory gains of Symmetric Attention, with minute overhead in terms of model parameters and computational overhead, the proposed model brings in enhanced performance in terms of accuracy and inference-time sampling. The proposed model is validated upon GPT2 base model and the results reflect the performance gains falling between plain Symmetric attention and GPT2 base model on a variety of GLUE benchmark tasks in terms of accuracy, with significant model size reduction with respect to the base model.
Related papers
- Large Language Models Inference Engines based on Spiking Neural Networks [5.529385616266398]
We explore spiking neural networks (SNNs) to design transformer models.<n>A challenge in training large-scale SNNs is inefficient and time-consuming.<n>We propose NeurTransformer, a methodology for designing transformer-based SNN for inference.
arXiv Detail & Related papers (2025-09-30T18:11:13Z) - Noise Hypernetworks: Amortizing Test-Time Compute in Diffusion Models [57.49136894315871]
New paradigm of test-time scaling has yielded remarkable breakthroughs in reasoning models and generative vision models.<n>We propose one solution to the problem of integrating test-time scaling knowledge into a model during post-training.<n>We replace reward guided test-time noise optimization in diffusion models with a Noise Hypernetwork that modulates initial input noise.
arXiv Detail & Related papers (2025-08-13T17:33:37Z) - Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer [17.463052541838504]
Fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy.<n>Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate interference when merging model parameters across tasks.<n>We introduce a novel method called Neural Pruning (NPS-Pruning) for slimming down fine-tuned models.
arXiv Detail & Related papers (2025-05-24T14:27:20Z) - Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models [1.8434042562191815]
This paper explores the compression of SSM-based models, particularly Mamba and its hybrids.<n>We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy.<n>The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance.
arXiv Detail & Related papers (2025-01-28T17:22:01Z) - Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training [32.154166415680066]
Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones.
This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment.
arXiv Detail & Related papers (2024-02-07T17:07:41Z) - Scaling Pre-trained Language Models to Deeper via Parameter-efficient
Architecture [68.13678918660872]
We design a more capable parameter-sharing architecture based on matrix product operator (MPO)
MPO decomposition can reorganize and factorize the information of a parameter matrix into two parts.
Our architecture shares the central tensor across all layers for reducing the model size.
arXiv Detail & Related papers (2023-03-27T02:34:09Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Edge Federated Learning Via Unit-Modulus Over-The-Air Computation
(Extended Version) [64.76619508293966]
This paper proposes a unit-modulus over-the-air computation (UM-AirComp) framework to facilitate efficient edge federated learning.
It uploads simultaneously local model parameters and updates global model parameters via analog beamforming.
We demonstrate the implementation of UM-AirComp in a vehicle-to-everything autonomous driving simulation platform.
arXiv Detail & Related papers (2021-01-28T15:10:22Z) - Understanding the effect of hyperparameter optimization on machine
learning models for structure design problems [8.504300709184177]
Machine learning algorithms (MLAs) have been implemented as surrogate models in computer-aided engineering design.
There is a lack of systematic studies on the effect of hyperparameters on the accuracy and robustness of the surrogate model.
Four frequently used MLAs, namely Gaussian Process Regression (GPR), Support Vector Machine (SVM), Random Forest Regression (RFR) and Artificial Neural Network (ANN) are tested.
The results show that HOpt can generally improve the performance of the MLA models in general.
arXiv Detail & Related papers (2020-07-04T14:57:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.