Less is More! A slim architecture for optimal language translation
- URL: http://arxiv.org/abs/2305.10991v1
- Date: Thu, 18 May 2023 14:09:52 GMT
- Title: Less is More! A slim architecture for optimal language translation
- Authors: Luca Herranz-Celotti, Ermal Rrapaj
- Abstract summary: The softmax attention mechanism has emerged as a noteworthy development in the field of Artificial Intelligence research.
We propose KgV, a sigmoid gating mechanism that boosts performance without increasing architecture size.
Our proposed method yields significant improvements in performance and much lower memory cost.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The softmax attention mechanism has emerged as a noteworthy development in
the field of Artificial Intelligence research, building on the successes of
Transformer-based architectures. However, their ever increasing sizes
necessitate ever increasing computational memory, that limits their usage. We
propose KgV, a sigmoid gating mechanism that, in conjunction with softmax
attention, significantly boosts performance without increasing architecture
size. To amend the size requirements, we leverage Tensor Chains to identify and
prune the excess parameters. We find that such excess resides primarily within
the embedding layer, and not in the output linear layer. To further improve
embedding and significantly reduce parameters, we introduce H-SoftPOS, a
hierarchical embedding layer which simultaneously enhances performance.
Remarkably, on the WMT14 English-German validation set, our approach yields a
threefold reduction in perplexity, surpassing the current state-of-the-art,
while reducing parameter counts also by a factor of 3. When we further reduce
the number of parameters up to sevenfold, we can still achieve a 21\% decrease
in perplexity with respect to the baseline Transformer. To understand
generalization capabilities, we conduct experiments on the 7 language pairs of
the WMT17 dataset. Our method outperforms existing techniques in terms of test
loss while simultaneously halving the number of parameters. Moreover, we
observe a 70 times reduction in variance with respect to the prior
state-of-the-art. In conclusion, our proposed method yields significant
improvements in performance and much lower memory cost. We call the resulting
architecture Anthe.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step.
Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency.
On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z) - ReduceFormer: Attention with Tensor Reduction by Summation [4.985969607297595]
We introduce ReduceFormer, a family of models optimized for efficiency with the spirit of attention.
ReduceFormer leverages only simple operations such as reduction and element-wise multiplication, leading to greatly simplified architecture and improved inference performance.
The proposed model family is suitable for edge devices where compute resource and memory bandwidth are limited, as well as for cloud computing where high throughput is sought after.
arXiv Detail & Related papers (2024-06-11T17:28:09Z) - Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation [12.07880147193174]
We show that by leveraging the inherent low-dimensional structures of data and compressible dynamics within the model parameters, we can reap the benefits of over parameterization without the computational burdens.
We demonstrate the effectiveness of this approach for deep low-rank matrix completion as well as fine-tuning language models.
arXiv Detail & Related papers (2024-06-06T14:29:49Z) - Is Mamba Compatible with Trajectory Optimization in Offline Reinforcement Learning? [32.33214392196923]
Transformer-based trajectory optimization methods have demonstrated exceptional performance in offline Reinforcement Learning (offline RL)
This work aims to conduct comprehensive experiments to explore the potential of Decision Mamba in offline RL (dubbed DeMa)
Our specially designed DeMa is compatible with trajectory optimization and surpasses previous state-of-the-art methods.
arXiv Detail & Related papers (2024-05-20T15:05:47Z) - Data-freeWeight Compress and Denoise for Large Language Models [101.53420111286952]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.
We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks [1.5199992713356987]
This paper introduces CompactifAI, an innovative compression approach using quantum-inspired networks.
Our method is versatile and can be implemented with - or on top of - other compression techniques.
As a benchmark, we demonstrate that a combination of CompactifAI with quantization allows to reduce a 93% memory size of LlaMA 7B.
arXiv Detail & Related papers (2024-01-25T11:45:21Z) - Multi-Grid Tensorized Fourier Neural Operator for High-Resolution PDEs [93.82811501035569]
We introduce a new data efficient and highly parallelizable operator learning approach with reduced memory requirement and better generalization.
MG-TFNO scales to large resolutions by leveraging local and global structures of full-scale, real-world phenomena.
We demonstrate superior performance on the turbulent Navier-Stokes equations where we achieve less than half the error with over 150x compression.
arXiv Detail & Related papers (2023-09-29T20:18:52Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained
Language Models [68.9288651177564]
We present a novel MoE architecture based on matrix product operators (MPO) from quantum many-body physics.
With the decomposed MPO structure, we can reduce the parameters of the original MoE architecture.
Experiments on the three well-known downstream natural language datasets based on GPT2 show improved performance and efficiency in increasing model capacity.
arXiv Detail & Related papers (2022-03-02T13:44:49Z) - Efficient Micro-Structured Weight Unification and Pruning for Neural
Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices.
Previous unstructured or structured weight pruning methods can hardly truly accelerate inference.
We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.