Related papers: Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification

Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification

URL: http://arxiv.org/abs/2009.02444v2
Date: Wed, 9 Sep 2020 16:53:21 GMT
Title: Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification
Authors: Zhenyu Wang, Wei Xia, John H.L. Hansen
Abstract summary: This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments. We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
Score: 61.54074498090374
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Forensic audio analysis for speaker verification offers unique challenges due to location/scenario uncertainty and diversity mismatch between reference and naturalistic field recordings. The lack of real naturalistic forensic audio corpora with ground-truth speaker identity represents a major challenge in this field. It is also difficult to directly employ small-scale domain-specific data to train complex neural network architectures due to domain mismatch and loss in performance. Alternatively, cross-domain speaker verification for multiple acoustic environments is a challenging task which could advance research in audio forensics. In this study, we introduce a CRSS-Forensics audio dataset collected in multiple acoustic environments. We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics. Based on this fine-tuned model, we align domain-specific distributions in the embedding space with the discrepancy loss and maximum mean discrepancy (MMD). This maintains effective performance on the clean set, while simultaneously generalizes the model to other acoustic domains. From the results, we demonstrate that diverse acoustic environments affect the speaker verification performance, and that our proposed approach of cross-domain adaptation can significantly improve the results in this scenario.

Related papers

Audio-based Kinship Verification Using Age Domain Conversion [39.4890403254022]
Key challenge in the task arises from differences in age across samples from different individuals. We utilise the optimised CycleGAN-VC3 network to perform age-audio conversion to generate the in-domain audio. The generated audio dataset is employed to extract a range of features, which are then fed into a metric learning architecture to verify kinship.
arXiv Detail & Related papers (2024-10-14T22:08:57Z)
Multi-source Domain Adaptation for Text-independent Forensic Speaker Recognition [36.83842373791537]
Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model. Previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains. Three novel adaptation methods are proposed to further promote adaptation performance across multiple acoustic domains.
arXiv Detail & Related papers (2022-11-17T22:11:25Z)
Cross-domain Voice Activity Detection with Self-Supervised Representations [9.02236667251654]
Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal. Current state-of-the-art methods focus on training a neural network exploiting features directly contained in the acoustics. We show that representations based on Self-Supervised Learning (SSL) can adapt well to different domains.
arXiv Detail & Related papers (2022-09-22T14:53:44Z)
Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech. This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training. Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z)
AdaStereo: An Efficient Domain-Adaptive Stereo Matching Approach [50.855679274530615]
We present a novel domain-adaptive approach called AdaStereo to align multi-level representations for deep stereo matching networks. Our models achieve state-of-the-art cross-domain performance on multiple benchmarks, including KITTI, Middlebury, ETH3D and DrivingStereo. Our method is robust to various domain adaptation settings, and can be easily integrated into quick adaptation application scenarios and real-world deployments.
arXiv Detail & Related papers (2021-12-09T15:10:47Z)
DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning [69.70594547377283]
We propose a novel framework to disentangle speaker-related and domain-specific features. Our framework can effectively generate more speaker-discriminative and domain-invariant speaker representations.
arXiv Detail & Related papers (2020-12-12T19:46:56Z)
Unsupervised Domain Adaptation for Acoustic Scene Classification Using Band-Wise Statistics Matching [69.24460241328521]
Machine learning algorithms can be negatively affected by mismatches between training (source) and test (target) data distributions. We propose an unsupervised domain adaptation method that consists of aligning the first- and second-order sample statistics of each frequency band of target-domain acoustic scenes to the ones of the source-domain training dataset. We show that the proposed method outperforms the state-of-the-art unsupervised methods found in the literature in terms of both source- and target-domain classification accuracy.
arXiv Detail & Related papers (2020-04-30T23:56:05Z)
Toward Cross-Domain Speech Recognition with End-to-End Models [18.637636841477]
In this paper, we empirically examine the difference in behavior between hybrid acoustic models and neural end-to-end systems. We show that for the hybrid models, supplying additional training data from other domains with mismatched acoustic conditions does not increase the performance on specific domains. Our end-to-end models optimized with sequence-based criterion generalize better than the hybrid models on diverse domains.
arXiv Detail & Related papers (2020-03-09T15:19:53Z)
Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions. Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks. This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.