Learning Invariant Representation and Risk Minimized for Unsupervised
Accent Domain Adaptation
- URL: http://arxiv.org/abs/2210.08182v1
- Date: Sat, 15 Oct 2022 03:56:31 GMT
- Title: Learning Invariant Representation and Risk Minimized for Unsupervised
Accent Domain Adaptation
- Authors: Chendong Zhao, Jianzong Wang, Xiaoyang Qu, Haoqian Wang, Jing Xiao
- Abstract summary: Unsupervised representation learning for speech audios attained impressive performances for speech recognition tasks.
In this work, we explore learning domain-invariant representations via a direct mapping of speech representations to their corresponding high-level linguistic informations.
Results prove that the learned latents not only capture the articulatory feature of each phoneme but also enhance the adaptation ability, outperforming the baseline largely on accented benchmarks.
- Score: 32.75866643254402
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unsupervised representation learning for speech audios attained impressive
performances for speech recognition tasks, particularly when annotated speech
is limited. However, the unsupervised paradigm needs to be carefully designed
and little is known about what properties these representations acquire. There
is no guarantee that the model learns meaningful representations for valuable
information for recognition. Moreover, the adaptation ability of the learned
representations to other domains still needs to be estimated. In this work, we
explore learning domain-invariant representations via a direct mapping of
speech representations to their corresponding high-level linguistic
informations. Results prove that the learned latents not only capture the
articulatory feature of each phoneme but also enhance the adaptation ability,
outperforming the baseline largely on accented benchmarks.
Related papers
- A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech
Enhancement [16.900731393703648]
Self-supervised learned models have been found to be very effective for certain speech tasks.
In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions.
arXiv Detail & Related papers (2024-03-03T02:05:17Z) - SememeASR: Boosting Performance of End-to-End Speech Recognition against
Domain and Long-Tailed Data Shift with Sememe Semantic Knowledge [58.979490858061745]
We introduce sememe-based semantic knowledge information to speech recognition.
Our experiments show that sememe information can improve the effectiveness of speech recognition.
In addition, our further experiments show that sememe knowledge can improve the model's recognition of long-tailed data.
arXiv Detail & Related papers (2023-09-04T08:35:05Z) - SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech.
Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z) - Supervised Acoustic Embeddings And Their Transferability Across
Languages [2.28438857884398]
In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise.
Self-supervised pre-training has been proposed as a way to improve both supervised and unsupervised speech recognition.
arXiv Detail & Related papers (2023-01-03T09:37:24Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Towards End-to-end Unsupervised Speech Recognition [120.4915001021405]
We introduce wvu which does away with all audio-side pre-processing and improves accuracy through better architecture.
In addition, we introduce an auxiliary self-supervised objective that ties model predictions back to the input.
Experiments show that wvuimproves unsupervised recognition results across different languages while being conceptually simpler.
arXiv Detail & Related papers (2022-04-05T21:22:38Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Learning De-identified Representations of Prosody from Raw Audio [7.025418443146435]
We propose a method for learning de-identified prosody representations from raw audio using a contrastive self-supervised signal.
We exploit the natural structure of prosody to minimize timbral information and decouple prosody from speaker representations.
arXiv Detail & Related papers (2021-07-17T14:37:25Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.