An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
- URL: http://arxiv.org/abs/2409.02596v1
- Date: Wed, 4 Sep 2024 10:27:07 GMT
- Title: An Analysis of Linear Complexity Attention Substitutes with BEST-RQ
- Authors: Ryan Whetten, Titouan Parcollet, Adel Moumen, Marco Dinarelli, Yannick Estève,
- Abstract summary: Self-Supervised Learning has proven to be effective in various domains, including speech processing.
This is in part due the quadratic complexity of multi-head self-attention (MHSA)
- Score: 23.934743358907895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention (MHSA). Alternatives for MHSA have been proposed and used in the speech domain, but have yet to be investigated properly in an SSL setting. In this work, we study the effects of replacing MHSA with recent state-of-the-art alternatives that have linear complexity, namely, HyperMixing, Fastformer, SummaryMixing, and Mamba. We evaluate these methods by looking at the speed, the amount of VRAM consumed, and the performance on the SSL MP3S benchmark. Results show that these linear alternatives maintain competitive performance compared to MHSA while, on average, decreasing VRAM consumption by around 20% to 60% and increasing speed from 7% to 65% for input sequences ranging from 20 to 80 seconds.
Related papers
- Dynamic Low-Rank Sparse Adaptation for Large Language Models [54.1231638555233]
Low-rank Sparse Adaptation (LoSA) is a novel method that seamlessly integrates low-rank adaptation into sparse LLM sparsity.
LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning.
LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden.
arXiv Detail & Related papers (2025-02-20T18:37:32Z) - Learning Harmonized Representations for Speculative Sampling [6.053109175707596]
Speculative sampling is a promising approach to accelerate the decoding stage for Large Language Models (LLMs)
We propose a solution named HArmonized Speculative Sampling (HASS) that learns harmonized representations to address these issues.
HASS accelerates the decoding stage without adding inference overhead through harmonized objective distillation and harmonized context alignment.
arXiv Detail & Related papers (2024-08-28T12:59:12Z) - Linear-Complexity Self-Supervised Learning for Speech Processing [17.360059094663182]
Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPU.
This paper studies a linear-complexity context encoder for SSL for the first time.
arXiv Detail & Related papers (2024-07-18T10:34:33Z) - DailyMAE: Towards Pretraining Masked Autoencoders in One Day [37.206816999538496]
Masked image modeling (MIM) has drawn attention for its effectiveness in learning data representation from unlabeled data.
In this study, we propose efficient training recipes for MIM based SSL that focuses on mitigating data loading bottlenecks.
Our library enables the training of a MAE-Base/16 model on the ImageNet 1K dataset for 800 epochs within just 18 hours.
arXiv Detail & Related papers (2024-03-31T00:59:10Z) - Are We on the Right Way for Evaluating Large Vision-Language Models? [92.5761176224556]
Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities.
We identify two primary issues: Visual content is unnecessary for many samples and intentional data leakage exists.
We present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans.
arXiv Detail & Related papers (2024-03-29T17:59:34Z) - Joint Prediction and Denoising for Large-scale Multilingual
Self-supervised Learning [69.77973092264338]
We show that more powerful techniques can lead to more efficient pre-training, opening SSL to more research groups.
We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages.
We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data.
arXiv Detail & Related papers (2023-09-26T23:55:57Z) - Full Parameter Fine-tuning for Large Language Models with Limited Resources [55.794732214059806]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training.
We propose a new computation, LOw-Memory Optimization (LOMO), which fuses the gradient and the parameter update in one step to reduce memory usage.
arXiv Detail & Related papers (2023-06-16T11:37:15Z) - Speech separation with large-scale self-supervised learning [41.96634125460265]
Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments.
We extend the exploration of the SSL-based SS by massively scaling up both the pre-training data (more than 300K hours) and fine-tuning data (10K hours)
arXiv Detail & Related papers (2022-11-09T20:00:21Z) - Exploring Efficient-tuning Methods in Self-supervised Speech Models [53.633222197712875]
Self-supervised learning can learn powerful representations for different speech tasks.
In downstream tasks, the parameters of SSL models are frozen, and only the adapters are trained.
We show that the performance parity can be achieved with over 90% parameter reduction.
arXiv Detail & Related papers (2022-10-10T11:08:12Z) - Acceleration of Subspace Learning Machine via Particle Swarm
Optimization and Parallel Processing [23.33955958124822]
Subspace learning machine (SLM) has been proposed to offer higher performance in general classification and regression tasks.
Performance improvement is reached at the expense of higher computational complexity.
Experimental results show that the accelerated SLM method achieves a speed up factor of 577 in training time.
arXiv Detail & Related papers (2022-08-15T06:33:15Z) - Sound and Visual Representation Learning with Multiple Pretraining Tasks [104.11800812671953]
Self-supervised tasks (SSL) reveal different features from the data.
This work aims to combine Multiple SSL tasks (Multi-SSL) that generalizes well for all downstream tasks.
Experiments on sound representations demonstrate that Multi-SSL via incremental learning (IL) of SSL tasks outperforms single SSL task models.
arXiv Detail & Related papers (2022-01-04T09:09:38Z) - Improving Efficiency in Large-Scale Decentralized Distributed Training [58.80224380923698]
We propose techniques to accelerate (A)D-PSGD based training by improving the spectral gap while minimizing the communication cost.
We demonstrate the effectiveness of our proposed techniques by running experiments on the 2000-hour Switchboard speech recognition task and the ImageNet computer vision task.
arXiv Detail & Related papers (2020-02-04T04:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.