Related papers: TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs

TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs

URL: http://arxiv.org/abs/2501.15674v2
Date: Thu, 15 May 2025 12:42:44 GMT
Title: TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs
Authors: Yuxuan Gu, Wuyang Zhou, Giorgos Iacovides, Danilo Mandic,
Abstract summary: We propose a novel framework that performs MHA compression through a multi-head tensorisation process and the Tucker decomposition.<n>We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets.<n>We show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.
Score: 3.808154352665581
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The reasoning abilities of Large Language Models (LLMs) can be improved by structurally denoising their weights, yet existing techniques primarily focus on denoising the feed-forward network (FFN) of the transformer block, and can not efficiently utilise the Multi-head Attention (MHA) block, which is the core of transformer architectures. To address this issue, we propose a novel intuitive framework that, at its very core, performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. This enables both higher-dimensional structured denoising and compression of the MHA weights, by enforcing a shared higher-dimensional subspace across the weights of the multiple attention heads. We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets, and for both encoder-only and decoder-only architectures, while achieving compression rates of up to $\sim 250$ times in the MHA weights, all without requiring any additional data, training, or fine-tuning. Furthermore, we show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.

Related papers

Explicit Multi-head Attention for Inter-head Interaction in Large Language Models [70.96854312026319]
Multi-head Explicit Attention (MEA) is a simple yet effective attention variant that explicitly models cross-head interaction.<n>MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence.<n>This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss.
arXiv Detail & Related papers (2026-01-27T13:45:03Z)
Multiscale Aggregated Hierarchical Attention (MAHA): A Game Theoretic and Optimization Driven Approach to Efficient Contextual Modeling in Large Language Models [0.0]
Multiscale Aggregated Hierarchical Attention (MAHA) is a novel architectural framework that reformulates the attention mechanism through hierarchical decomposition and mathematically rigorous aggregation.<n>MAHA dynamically partitions the input sequence into hierarchical scales via learnable downsampling operators.<n> Experimental evaluations demonstrate that MAHA achieves superior scalability; empirical FLOPs analysis confirms an 81% reduction in computational cost at a sequence length of 4096 compared to standard attention.
arXiv Detail & Related papers (2025-12-16T21:27:21Z)
DLRREC: Denoising Latent Representations via Multi-Modal Knowledge Fusion in Deep Recommender Systems [0.6875312133832079]
Large Language Models (LLMs) generate rich, yet high-dimensional and noisy, multi-modal features.<n>Treating these features as static inputs decouples them from the core recommendation task.<n>We introduce a novel framework built on a key insight: deeply fusing multi-modal and collaborative knowledge for representation denoising.
arXiv Detail & Related papers (2025-11-29T18:57:42Z)
MoME: Mixture of Matryoshka Experts for Audio-Visual Speech Recognition [39.90876258237132]
Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities.<n>MoME is a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based large language models for speech recognition.<n>MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters.
arXiv Detail & Related papers (2025-10-05T10:34:34Z)
When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs [38.29061845878822]
We propose an image Codec TAilored to MLLMs (CoTAM) designed to adaptively protect multi-level features and suit different demands of downstream tasks.<n>Our method achieves up to 35.99% saving while maintaining the same performance on the MLLM tasks.
arXiv Detail & Related papers (2025-09-29T04:07:52Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models [56.00251589760559]
Large language models (LLMs) can act as gradient priors in a zero-shot setting. We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding. Experiments indicate that LM-GC surpasses existing state-of-the-art lossless compression methods.
arXiv Detail & Related papers (2024-09-26T13:38:33Z)
Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration. Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z)
LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models [9.244526043014098]
Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure. We propose a mixed compression model, which organically combines Low-Rank matrix And structured Pruning (LoRAP)
arXiv Detail & Related papers (2024-04-15T11:53:22Z)
CRaSh: Clustering, Removing, and Sharing Enhance Fine-tuning without Full Large Language Model [22.870512676002463]
This paper focuses on Offsite-Tuning (OFT), a representative technique that transfers transformer blocks between centralized LLMs and downstream emulators. Inspired by these observations, we propose CRaSh, involving Clustering, Removing, and Sharing, a training-free strategy to derive improved emulators from LLMs. Our findings demonstrate a linear connectivity among these optima falling over the same basin, thereby highlighting the effectiveness of CRaSh and OFT.
arXiv Detail & Related papers (2023-10-24T03:08:58Z)
Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM) This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature. We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z)
Adaptive Dynamic Filtering Network for Image Denoising [8.61083713580388]
In image denoising networks, feature scaling is widely used to enlarge the receptive field size and reduce computational costs. We propose to employ dynamic convolution to improve the learning of high-frequency and multi-scale features. We build an efficient denoising network with the proposed DCB and MDCB, named ADFNet.
arXiv Detail & Related papers (2022-11-22T06:54:27Z)
Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics. With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture. Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z)
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain. AFNO is based on a principled foundation of operator learning. It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.