Linear-Complexity Self-Supervised Learning for Speech Processing
- URL: http://arxiv.org/abs/2407.13377v1
- Date: Thu, 18 Jul 2024 10:34:33 GMT
- Title: Linear-Complexity Self-Supervised Learning for Speech Processing
- Authors: Shucong Zhang, Titouan Parcollet, Rogier van Dalen, Sourav Bhattacharya,
- Abstract summary: Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPU.
This paper studies a linear-complexity context encoder for SSL for the first time.
- Score: 17.360059094663182
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised learning (SSL) models usually require weeks of pre-training with dozens of high-end GPUs. These models typically have a multi-headed self-attention (MHSA) context encoder. However, MHSA takes quadratic time and space in the input length, contributing to the high pre-training cost. Linear-complexity alternatives to MHSA have been proposed. For instance, in supervised training, the SummaryMixing model is the first to outperform MHSA across multiple speech processing tasks. However, these cheaper alternatives have not been explored for SSL yet. This paper studies a linear-complexity context encoder for SSL for the first time. With better or equivalent performance for the downstream tasks of the MP3S benchmark, SummaryMixing reduces the pre-training time and peak VRAM of wav2vec 2.0 model by 18% and by 23%, respectively, leading to the pre-training of a 155M wav2vec 2.0 model finished within one week with 4 Tesla A100 GPUs. Code is available at https://github.com/SamsungLabs/SummaryMixing.
Related papers
- An Analysis of Linear Complexity Attention Substitutes with BEST-RQ [23.934743358907895]
Self-Supervised Learning has proven to be effective in various domains, including speech processing.
This is in part due the quadratic complexity of multi-head self-attention (MHSA)
arXiv Detail & Related papers (2024-09-04T10:27:07Z) - Open Implementation and Study of BEST-RQ for Speech Processing [25.678292575349648]
BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has shown great performance on Automatic Speech Recognition (ASR)
We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
arXiv Detail & Related papers (2024-05-07T13:11:37Z) - Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - DailyMAE: Towards Pretraining Masked Autoencoders in One Day [37.206816999538496]
Masked image modeling (MIM) has drawn attention for its effectiveness in learning data representation from unlabeled data.
In this study, we propose efficient training recipes for MIM based SSL that focuses on mitigating data loading bottlenecks.
Our library enables the training of a MAE-Base/16 model on the ImageNet 1K dataset for 800 epochs within just 18 hours.
arXiv Detail & Related papers (2024-03-31T00:59:10Z) - Joint Prediction and Denoising for Large-scale Multilingual
Self-supervised Learning [69.77973092264338]
We show that more powerful techniques can lead to more efficient pre-training, opening SSL to more research groups.
We propose WavLabLM, which extends WavLM's joint prediction and denoising to 40k hours of data across 136 languages.
We show that further efficiency can be achieved with a vanilla HuBERT Base model, which can maintain 94% of XLS-R's performance with only 3% of the data.
arXiv Detail & Related papers (2023-09-26T23:55:57Z) - LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech [70.3307853082527]
This work introduces LeBenchmark 2.0 an open-source framework for assessing and building SSL-equipped French speech technologies.
It includes documented, large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous speech.
It includes ten pre-trained SSL wav2vec 2.0 models containing from 26 million to one billion learnable parameters shared with the community.
arXiv Detail & Related papers (2023-09-11T14:13:09Z) - Low-rank Adaptation Method for Wav2vec2-based Fake Audio Detection [57.537583869961885]
Self-supervised speech models are a rapidly developing research topic in fake audio detection.
We apply low-rank adaptation(LoRA) to the wav2vec2 model, freezing the pre-trained model weights and injecting a trainable rank-decomposition matrix into each layer of the transformer architecture.
Compared with fine-tuning with Adam on the wav2vec2 model containing 317M training parameters, LoRA achieved similar performance by reducing the number of trainable parameters by 198 times.
arXiv Detail & Related papers (2023-06-09T01:43:41Z) - Cross-Modal Adapter for Text-Video Retrieval [91.9575196703281]
We present a novel $textbfCross-Modal Adapter$ for parameter-efficient fine-tuning.
Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers.
It achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets.
arXiv Detail & Related papers (2022-11-17T16:15:30Z) - Match to Win: Analysing Sequences Lengths for Efficient Self-supervised
Learning in Speech and Audio [19.865050806327147]
Self-supervised learning has proven vital in speech and audio-related applications.
This paper provides the first empirical study of SSL pre-training for different specified sequence lengths.
We find that training on short sequences can dramatically reduce resource costs while retaining a satisfactory performance for all tasks.
arXiv Detail & Related papers (2022-09-30T16:35:42Z) - USB: A Unified Semi-supervised Learning Benchmark [125.25384569880525]
Semi-supervised learning (SSL) improves model generalization by leveraging massive unlabeled data to augment limited labeled samples.
Previous work typically trains deep neural networks from scratch, which is time-consuming and environmentally unfriendly.
We construct a Unified SSL Benchmark (USB) by selecting 15 diverse, challenging, and comprehensive tasks.
arXiv Detail & Related papers (2022-08-12T15:45:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.