General-Purpose Speech Representation Learning through a Self-Supervised
Multi-Granularity Framework
- URL: http://arxiv.org/abs/2102.01930v1
- Date: Wed, 3 Feb 2021 08:13:21 GMT
- Title: General-Purpose Speech Representation Learning through a Self-Supervised
Multi-Granularity Framework
- Authors: Yucheng Zhao, Dacheng Yin, Chong Luo, Zhiyuan Zhao, Chuanxin Tang,
Wenjun Zeng, Zheng-Jun Zha
- Abstract summary: This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning.
Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales.
- Score: 114.63823178097402
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents a self-supervised learning framework, named MGF, for
general-purpose speech representation learning. In the design of MGF, speech
hierarchy is taken into consideration. Specifically, we propose to use
generative learning approaches to capture fine-grained information at small
time scales and use discriminative learning approaches to distill
coarse-grained or semantic information at large time scales. For phoneme-scale
learning, we borrow idea from the masked language model but tailor it for the
continuous speech signal by replacing classification loss with a contrastive
loss. We corroborate our design by evaluating MGF representation on various
downstream tasks, including phoneme classification, speaker classification,
speech recognition, and emotion classification. Experiments verify that
training at different time scales needs different training targets and loss
functions, which in general complement each other and lead to a better
performance.
Related papers
- Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - MASR: Multi-label Aware Speech Representation [36.2978180342839]
We propose MASR, a Multi-label Aware Speech Representation learning framework.
MASR enables the inclusion of multiple external knowledge sources to enhance the utilization of meta-data information.
We show significant performance improvements for the MASR over other established benchmarks.
arXiv Detail & Related papers (2023-07-20T16:09:57Z) - SLICER: Learning universal audio representations using low-resource
self-supervised pre-training [53.06337011259031]
We present a new Self-Supervised Learning approach to pre-train encoders on unlabeled audio data.
Our primary aim is to learn audio representations that can generalize across a large variety of speech and non-speech tasks.
arXiv Detail & Related papers (2022-11-02T23:45:33Z) - A Simple Meta-learning Paradigm for Zero-shot Intent Classification with
Mixture Attention Mechanism [17.228616743739412]
We propose a simple yet effective meta-learning paradigm for zero-shot intent classification.
To learn better semantic representations for utterances, we introduce a new mixture attention mechanism.
To strengthen the transfer ability of the model from seen classes to unseen classes, we reformulate zero-shot intent classification with a meta-learning strategy.
arXiv Detail & Related papers (2022-06-05T13:37:51Z) - data2vec: A General Framework for Self-supervised Learning in Speech,
Vision and Language [85.9019051663368]
data2vec is a framework that uses the same learning method for either speech, NLP or computer vision.
The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup.
Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance.
arXiv Detail & Related papers (2022-02-07T22:52:11Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - A Framework to Enhance Generalization of Deep Metric Learning methods
using General Discriminative Feature Learning and Class Adversarial Neural
Networks [1.5469452301122175]
Metric learning algorithms aim to learn a distance function that brings semantically similar data items together and keeps dissimilar ones at a distance.
Deep Metric Learning (DML) methods are proposed that automatically extract features from data and learn a non-linear transformation from input space to a semantically embedding space.
We propose a framework to enhance the generalization power of existing DML methods in a Zero-Shot Learning (ZSL) setting.
arXiv Detail & Related papers (2021-06-11T14:24:40Z) - Speech SIMCLR: Combining Contrastive and Reconstruction Objective for
Self-supervised Speech Representation Learning [20.39971017940006]
Speech SimCLR is a new self-supervised objective for speech representation learning.
During training, SimCLR applies augmentation on raw speech and its spectrogram.
arXiv Detail & Related papers (2020-10-27T02:09:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.