Unsupervised Domain Adaptation in Speech Recognition using Phonetic
Features
- URL: http://arxiv.org/abs/2108.02850v1
- Date: Wed, 4 Aug 2021 06:22:12 GMT
- Title: Unsupervised Domain Adaptation in Speech Recognition using Phonetic
Features
- Authors: Rupam Ojha, C Chandra Sekhar
- Abstract summary: We propose a technique to perform unsupervised gender-based domain adaptation in speech recognition using phonetic features.
Experiments are performed on the TIMIT dataset and there is a considerable decrease in the phoneme error rate using the proposed approach.
- Score: 6.872447420442981
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Automatic speech recognition is a difficult problem in pattern recognition
because several sources of variability exist in the speech input like the
channel variations, the input might be clean or noisy, the speakers may have
different accent and variations in the gender, etc. As a result, domain
adaptation is important in speech recognition where we train the model for a
particular source domain and test it on a different target domain. In this
paper, we propose a technique to perform unsupervised gender-based domain
adaptation in speech recognition using phonetic features. The experiments are
performed on the TIMIT dataset and there is a considerable decrease in the
phoneme error rate using the proposed approach.
Related papers
- Anonymising Elderly and Pathological Speech: Voice Conversion Using DDSP and Query-by-Example [4.42160195007899]
Speech anonymisation aims to protect speaker identity by changing personal identifiers in speech while retaining linguistic content.
Current methods fail to retain prosody and unique speech patterns found in elderly and pathological speech domains.
We propose a voice conversion-based method (DDSP-QbE) using differentiable digital signal processing and query-by-example.
arXiv Detail & Related papers (2024-10-20T20:40:56Z) - Damage Control During Domain Adaptation for Transducer Based Automatic
Speech Recognition [13.029537136528521]
A potential drawback of model adaptation to new domains is catastrophic forgetting, where the Word Error Rate on the original domain is significantly degraded.
This paper addresses the situation when we want to simultaneously adapt automatic speech recognition models to a new domain.
We propose several techniques such as a limited training strategy and regularized adapter modules for the Transducer encoder, prediction, and joiner network.
arXiv Detail & Related papers (2022-10-06T23:38:50Z) - Cross-domain Voice Activity Detection with Self-Supervised
Representations [9.02236667251654]
Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal.
Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics.
We show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains.
arXiv Detail & Related papers (2022-09-22T14:53:44Z) - Direction-Aware Joint Adaptation of Neural Speech Enhancement and
Recognition in Real Multiparty Conversational Environments [21.493664174262737]
This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments.
We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions.
arXiv Detail & Related papers (2022-07-15T03:43:35Z) - Improving speaker de-identification with functional data analysis of f0
trajectories [10.809893662563926]
Formant modification is a simpler, yet effective method for speaker de-identification which requires no training data.
This study introduces a novel speaker de-identification method, which, in addition to simple formant shifts, manipulates f0 trajectories based on functional data analysis.
The proposed speaker de-identification method will conceal plausibly identifying pitch characteristics in a phonetically controllable manner and improve formant-based speaker de-identification up to 25%.
arXiv Detail & Related papers (2022-03-31T01:34:15Z) - Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z) - Audio-Adaptive Activity Recognition Across Video Domains [112.46638682143065]
We leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening.
We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation.
We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically.
arXiv Detail & Related papers (2022-03-27T08:15:20Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - DEAAN: Disentangled Embedding and Adversarial Adaptation Network for
Robust Speaker Representation Learning [69.70594547377283]
We propose a novel framework to disentangle speaker-related and domain-specific features.
Our framework can effectively generate more speaker-discriminative and domain-invariant speaker representations.
arXiv Detail & Related papers (2020-12-12T19:46:56Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.