Pushing the Limits of Unsupervised Unit Discovery for SSL Speech
Representation
- URL: http://arxiv.org/abs/2306.08920v1
- Date: Thu, 15 Jun 2023 07:45:12 GMT
- Title: Pushing the Limits of Unsupervised Unit Discovery for SSL Speech
Representation
- Authors: Ziyang Ma, Zhisheng Zheng, Guanrou Yang, Yu Wang, Chao Zhang, Xie Chen
- Abstract summary: HuBERT is a successful example that utilizes offline clustering to convert speech features into discrete units for a masked language modeling pretext task.
We present an unsupervised method to improve SSL targets.
Two models are proposed, MonoBERT and PolyBERT, which leverage context-independent and context-dependent phoneme-based units for pre-training.
- Score: 12.506633315768832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The excellent generalization ability of self-supervised learning (SSL) for
speech foundation models has garnered significant attention. HuBERT is a
successful example that utilizes offline clustering to convert speech features
into discrete units for a masked language modeling pretext task. However,
simply clustering features as targets by k-means does not fully inspire the
model's performance. In this work, we present an unsupervised method to improve
SSL targets. Two models are proposed, MonoBERT and PolyBERT, which leverage
context-independent and context-dependent phoneme-based units for pre-training.
Our models outperform other SSL models significantly on the LibriSpeech
benchmark without the need for iterative re-clustering and re-training.
Furthermore, our models equipped with context-dependent units even outperform
target-improvement models that use labeled data during pre-training. How we
progressively improve the unit discovery process is demonstrated through
experiments.
Related papers
- ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets [106.7760874400261]
This paper presents ML-SUPERB2.0, which is a new benchmark for evaluating pre-trained SSL and supervised speech models.
We find performance improvements over the setup of ML-SUPERB, but performance depends on the downstream model design.
Also, we find large performance differences between languages and datasets, suggesting the need for more targeted approaches.
arXiv Detail & Related papers (2024-06-12T21:01:26Z) - Self-supervised Neural Factor Analysis for Disentangling Utterance-level
Speech Representations [30.293081541301746]
Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition.
We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective.
Our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.
arXiv Detail & Related papers (2023-05-14T08:26:24Z) - Model Extraction Attack against Self-supervised Speech Models [52.81330435990717]
Self-supervised learning (SSL) speech models generate meaningful representations of given clips.
Model extraction attack (MEA) often refers to an adversary stealing the functionality of the victim model with only query access.
We study the MEA problem against SSL speech model with a small number of queries.
arXiv Detail & Related papers (2022-11-29T09:28:05Z) - MT4SSL: Boosting Self-Supervised Speech Representation Learning by
Integrating Multiple Targets [6.238268985570237]
We provide a new perspective on self-supervised speech models from how the training targets are obtained.
We propose a new multi-tasking learning framework for self-supervised learning, MT4SSL.
Our model outperforms previous SSL methods by nontrivial margins on the LibriSpeech benchmark.
arXiv Detail & Related papers (2022-11-14T13:00:47Z) - Towards Sustainable Self-supervised Learning [193.78876000005366]
We propose a Target-Enhanced Conditional (TEC) scheme which introduces two components to the existing mask-reconstruction based SSL.
First, we propose patch-relation enhanced targets which enhances the target given by base model and encourages the new model to learn semantic-relation knowledge from the base model.
Secondly, we introduce a conditional adapter that adaptively adjusts new model prediction to align with the target of different base models.
arXiv Detail & Related papers (2022-10-20T04:49:56Z) - Efficient Gaussian Process Model on Class-Imbalanced Datasets for
Generalized Zero-Shot Learning [37.00463358780726]
We propose a Neural Network model that learns a latent feature embedding and a Gaussian Process (GP) regression model that predicts latent feature prototypes of unseen classes.
Our model is trained efficiently with a simple training strategy that mitigates the impact of class-imbalanced training data.
arXiv Detail & Related papers (2022-10-11T04:57:20Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - A Gating Model for Bias Calibration in Generalized Zero-shot Learning [18.32369721322249]
Generalized zero-shot learning (GZSL) aims at training a model that can generalize to unseen class data by only using auxiliary information.
One of the main challenges in GZSL is a biased model prediction toward seen classes caused by overfitting on only available seen class data during training.
We propose a two-stream autoencoder-based gating model for GZSL.
arXiv Detail & Related papers (2022-03-08T16:41:06Z) - Self-Supervised Learning for speech recognition with Intermediate layer
supervision [52.93758711230248]
We propose Intermediate Layer Supervision for Self-Supervised Learning (ILS-SSL)
ILS-SSL forces the model to concentrate on content information as much as possible by adding an additional SSL loss on the intermediate layers.
Experiments on LibriSpeech test-other set show that our method outperforms HuBERT significantly.
arXiv Detail & Related papers (2021-12-16T10:45:05Z) - SUPERB: Speech processing Universal PERformance Benchmark [78.41287216481203]
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV)
SuperB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks.
We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model.
arXiv Detail & Related papers (2021-05-03T17:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.