Self-supervised Rewiring of Pre-trained Speech Encoders: Towards Faster
Fine-tuning with Less Labels in Speech Processing
- URL: http://arxiv.org/abs/2210.13030v1
- Date: Mon, 24 Oct 2022 08:27:09 GMT
- Title: Self-supervised Rewiring of Pre-trained Speech Encoders: Towards Faster
Fine-tuning with Less Labels in Speech Processing
- Authors: Hao Yang, Jinming Zhao, Gholamreza Haffari and Ehsan Shareghi
- Abstract summary: We take a sober look into pre-trained speech encoders and rewire their representation space without requiring task-specific labels.
Our experiments on 6 speech processing tasks, exhibit a significant convergence speedup during task fine-tuning as well as consistent task improvement.
- Score: 66.92823764664206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained speech Transformers have facilitated great success across various
speech processing tasks. However, fine-tuning these encoders for downstream
tasks require sufficiently large training data to converge or to achieve
state-of-the-art. In text domain this has been partly attributed to
sub-optimality of the representation space in pre-trained Transformers. In this
work, we take a sober look into pre-trained speech encoders and rewire their
representation space without requiring any task-specific labels. Our method
utilises neutrally synthesised version of audio inputs along with frame masking
to construct positive pairs for contrastive self-supervised learning. When used
for augmenting the wav2vec 2 encoder, we observe consistent improvement of
isotropy in the representation space. Our experiments on 6 speech processing
tasks, exhibit a significant convergence speedup during task fine-tuning as
well as consistent task improvement, specially in low-resource settings.
Related papers
- Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech
Processing [102.45426364965887]
We propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks.
WavLM is built based on the HuBERT framework, with an emphasis on both spoken content modeling and speaker identity preservation.
We scale up the training dataset from 60k hours to 94k hours of public audio data, and optimize its training procedure for better representation extraction.
arXiv Detail & Related papers (2021-10-26T17:55:19Z) - ConvFiT: Conversational Fine-Tuning of Pretrained Language Models [42.7160113690317]
Transformer-based language models (LMs) pretrained on large text collections are proven to store a wealth of semantic knowledge.
We propose ConvFiT, a simple and efficient two-stage procedure which turns any pretrained LM into a universal conversational encoder.
arXiv Detail & Related papers (2021-09-21T12:16:56Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - On the Usefulness of Self-Attention for Automatic Speech Recognition
with Transformers [40.991809705930955]
We train models with lower self-attention/upper feed-forward layers encoders on Wall Street Journal and Switchboard.
Compared to baseline Transformers, no performance drop but minor gains are observed.
We conclude the global view is unnecessary in training upper encoder layers.
arXiv Detail & Related papers (2020-11-08T16:01:38Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.