Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification
- URL: http://arxiv.org/abs/2009.02444v2
- Date: Wed, 9 Sep 2020 16:53:21 GMT
- Title: Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification
- Authors: Zhenyu Wang, Wei Xia, John H.L. Hansen
- Abstract summary: This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
- Score: 61.54074498090374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Forensic audio analysis for speaker verification offers unique challenges due
to location/scenario uncertainty and diversity mismatch between reference and
naturalistic field recordings. The lack of real naturalistic forensic audio
corpora with ground-truth speaker identity represents a major challenge in this
field. It is also difficult to directly employ small-scale domain-specific data
to train complex neural network architectures due to domain mismatch and loss
in performance. Alternatively, cross-domain speaker verification for multiple
acoustic environments is a challenging task which could advance research in
audio forensics. In this study, we introduce a CRSS-Forensics audio dataset
collected in multiple acoustic environments. We pre-train a CNN-based network
using the VoxCeleb data, followed by an approach which fine-tunes part of the
high-level network layers with clean speech from CRSS-Forensics. Based on this
fine-tuned model, we align domain-specific distributions in the embedding space
with the discrepancy loss and maximum mean discrepancy (MMD). This maintains
effective performance on the clean set, while simultaneously generalizes the
model to other acoustic domains. From the results, we demonstrate that diverse
acoustic environments affect the speaker verification performance, and that our
proposed approach of cross-domain adaptation can significantly improve the
results in this scenario.
Related papers
- Audio-based Kinship Verification Using Age Domain Conversion [39.4890403254022]
Key challenge in the task arises from differences in age across samples from different individuals.
We utilise the optimised CycleGAN-VC3 network to perform age-audio conversion to generate the in-domain audio.
The generated audio dataset is employed to extract a range of features, which are then fed into a metric learning architecture to verify kinship.
arXiv Detail & Related papers (2024-10-14T22:08:57Z) - Multi-source Domain Adaptation for Text-independent Forensic Speaker
Recognition [36.83842373791537]
Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model.
Previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains.
Three novel adaptation methods are proposed to further promote adaptation performance across multiple acoustic domains.
arXiv Detail & Related papers (2022-11-17T22:11:25Z) - Cross-domain Voice Activity Detection with Self-Supervised
Representations [9.02236667251654]
Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal.
Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics.
We show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains.
arXiv Detail & Related papers (2022-09-22T14:53:44Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - AdaStereo: An Efficient Domain-Adaptive Stereo Matching Approach [50.855679274530615]
We present a novel domain-adaptive approach called AdaStereo to align multi-level representations for deep stereo matching networks.
Our models achieve state-of-the-art cross-domain performance on multiple benchmarks, including KITTI, Middlebury, ETH3D and DrivingStereo.
Our method is robust to various domain adaptation settings, and can be easily integrated into quick adaptation application scenarios and real-world deployments.
arXiv Detail & Related papers (2021-12-09T15:10:47Z) - DEAAN: Disentangled Embedding and Adversarial Adaptation Network for
Robust Speaker Representation Learning [69.70594547377283]
We propose a novel framework to disentangle speaker-related and domain-specific features.
Our framework can effectively generate more speaker-discriminative and domain-invariant speaker representations.
arXiv Detail & Related papers (2020-12-12T19:46:56Z) - Unsupervised Domain Adaptation for Acoustic Scene Classification Using
Band-Wise Statistics Matching [69.24460241328521]
Machine learning algorithms can be negatively affected by mismatches between training (source) and test (target) data distributions.
We propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset.
We show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.
arXiv Detail & Related papers (2020-04-30T23:56:05Z) - Toward Cross-Domain Speech Recognition with End-to-End Models [18.637636841477]
In this paper, we empirically examine the difference in behavior between hybrid acoustic models and neural end-to-end systems.
We show that for the hybrid models, supplying additional training data from other domains with mismatched acoustic conditions does not increase the performance on specific domains.
Our end-to-end models optimized with sequence-based criterion generalize better than the hybrid models on diverse domains.
arXiv Detail & Related papers (2020-03-09T15:19:53Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.