The Ability of Self-Supervised Speech Models for Audio Representations
- URL: http://arxiv.org/abs/2209.12900v2
- Date: Wed, 28 Sep 2022 03:39:56 GMT
- Title: The Ability of Self-Supervised Speech Models for Audio Representations
- Authors: Tung-Yu Wu, Chen-An Li, Tzu-Han Lin, Tsu-Yuan Hsu, Hung-Yi Lee
- Abstract summary: Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
- Score: 53.19715501273934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised learning (SSL) speech models have achieved unprecedented
success in speech representation learning, but some questions regarding their
representation ability remain unanswered. This paper addresses two of them: (1)
Can SSL speech models deal with non-speech audio?; (2) Would different SSL
speech models have insights into diverse aspects of audio features? To answer
the two questions, we conduct extensive experiments on abundant speech and
non-speech audio datasets to evaluate the representation ability of currently
state-of-the-art SSL speech models, which are wav2vec 2.0 and HuBERT in this
paper. These experiments are carried out during NeurIPS 2021 HEAR Challenge as
a standard evaluation pipeline provided by competition officials. Results show
that (1) SSL speech models could extract meaningful features of a wide range of
non-speech audio, while they may also fail on certain types of datasets; (2)
different SSL speech models have insights into different aspects of audio
features. The two conclusions provide a foundation for the ensemble of
representation models. We further propose an ensemble framework to fuse speech
representation models' embeddings. Our framework outperforms state-of-the-art
SSL speech/audio models and has generally superior performance on abundant
datasets compared with other teams in HEAR Challenge. Our code is available at
https://github.com/tony10101105/HEAR-2021-NeurIPS-Challenge -- NTU-GURA.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - What Do Self-Supervised Speech and Speaker Models Learn? New Findings
From a Cross Model Layer-Wise Analysis [44.93152068353389]
Self-supervised learning (SSL) has attracted increased attention for learning meaningful speech representations.
Speaker SSL models adopt utterance-level training objectives primarily for speaker representation.
arXiv Detail & Related papers (2024-01-31T07:23:22Z) - Do self-supervised speech and language models extract similar
representations as human brain? [2.390915090736061]
Speech and language models trained through self-supervised learning (SSL) demonstrate strong alignment with brain activity during speech and language perception.
We evaluate the brain prediction performance of two representative SSL models, Wav2Vec2.0 and GPT-2.
arXiv Detail & Related papers (2023-10-07T01:39:56Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - SpeechGLUE: How Well Can Self-Supervised Speech Models Capture Linguistic Knowledge? [45.901645659694935]
Self-supervised learning (SSL) for speech representation has been successfully applied in various downstream tasks.
In this paper, we aim to clarify if speech SSL techniques can well capture linguistic knowledge.
arXiv Detail & Related papers (2023-06-14T09:04:29Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio
Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model.
We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks.
In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.