TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech
- URL: http://arxiv.org/abs/2007.06028v3
- Date: Wed, 4 Aug 2021 05:38:15 GMT
- Title: TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech
- Authors: Andy T. Liu, Shang-Wen Li, and Hung-yi Lee
- Abstract summary: TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
- Score: 63.03318307254081
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a self-supervised speech pre-training method called TERA, which
stands for Transformer Encoder Representations from Alteration. Recent
approaches often learn by using a single auxiliary task like contrastive
prediction, autoregressive prediction, or masked reconstruction. Unlike
previous methods, we use alteration along three orthogonal axes to pre-train
Transformer Encoders on a large amount of unlabeled speech. The model learns
through the reconstruction of acoustic frames from their altered counterpart,
where we use a stochastic policy to alter along various dimensions: time,
frequency, and magnitude. TERA can be used for speech representations
extraction or fine-tuning with downstream models. We evaluate TERA on several
downstream tasks, including phoneme classification, keyword spotting, speaker
recognition, and speech recognition. We present a large-scale comparison of
various self-supervised models. TERA achieves strong performance in the
comparison by improving upon surface features and outperforming previous
models. In our experiments, we study the effect of applying different
alteration techniques, pre-training on more data, and pre-training on various
features. We analyze different model sizes and find that smaller models are
strong representation learners than larger models, while larger models are more
effective for downstream fine-tuning than smaller models. Furthermore, we show
the proposed method is transferable to downstream datasets not used in
pre-training.
Related papers
- Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models [6.809572275782338]
We develop a unified signal propagation theory and provide formulae that govern the moments of the forward and backward signal through the transformer model.
Our framework can be used to understand and mitigate vanishing/exploding gradients, rank collapse, and instability associated with high attention scores.
arXiv Detail & Related papers (2024-03-14T17:59:14Z) - Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - Feature Normalization for Fine-tuning Self-Supervised Models in Speech
Enhancement [19.632358491434697]
Large, pre-trained representation models trained using self-supervised learning have gained popularity in various fields of machine learning.
In this paper, we investigate the feasibility of using pre-trained speech representation models for a downstream speech enhancement task.
Our proposed method enables significant improvements in speech quality compared to baselines when combined with various types of pre-trained speech models.
arXiv Detail & Related papers (2023-06-14T10:03:33Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Transformers as Neural Augmentors: Class Conditional Sentence Generation
via Variational Bayes [0.0]
We propose a neural data augmentation method, which is a combination of Variational Autoencoder and encoder-decoder Transformer model.
While encoding and decoding the input sentence, our model captures the syntactic and semantic representation of the input language with its class condition.
Our model increases the performance of current models compared to other data augmentation techniques with a small amount of computation power.
arXiv Detail & Related papers (2022-05-19T08:42:33Z) - Entropy optimized semi-supervised decomposed vector-quantized
variational autoencoder model based on transfer learning for multiclass text
classification and generation [3.9318191265352196]
We propose a semisupervised discrete latent variable model for multi-class text classification and text generation.
The proposed model employs the concept of transfer learning for training a quantized transformer model.
Experimental results indicate that the proposed model has surpassed the state-of-the-art models remarkably.
arXiv Detail & Related papers (2021-11-10T07:07:54Z) - How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.