Phonetic-aware speaker embedding for far-field speaker verification
- URL: http://arxiv.org/abs/2311.15627v1
- Date: Mon, 27 Nov 2023 08:45:35 GMT
- Title: Phonetic-aware speaker embedding for far-field speaker verification
- Authors: Zezhong Jin, Youzhi Tu, Man-Wai Mak
- Abstract summary: We propose a joint-training speech recognition and speaker recognition framework to exploit phonetic content for far-field speaker verification.
The framework encourages speaker embeddings to preserve phonetic information by matching the frame-based feature maps of a speaker embedding network with wav2vec's vectors.
Results show that the proposed framework outperforms the standard speaker embedding on the VOiCES Challenge 2019 evaluation set and the VoxCeleb1 test set.
- Score: 25.50311094643337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When a speaker verification (SV) system operates far from the sound sourced,
significant challenges arise due to the interference of noise and
reverberation. Studies have shown that incorporating phonetic information into
speaker embedding can improve the performance of text-independent SV. Inspired
by this observation, we propose a joint-training speech recognition and speaker
recognition (JTSS) framework to exploit phonetic content for far-field SV. The
framework encourages speaker embeddings to preserve phonetic information by
matching the frame-based feature maps of a speaker embedding network with
wav2vec's vectors. The intuition is that phonetic information can preserve
low-level acoustic dynamics with speaker information and thus partly compensate
for the degradation due to noise and reverberation. Results show that the
proposed framework outperforms the standard speaker embedding on the VOiCES
Challenge 2019 evaluation set and the VoxCeleb1 test set. This indicates that
leveraging phonetic information under far-field conditions is effective for
learning robust speaker representations.
Related papers
- Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Phoneme-aware and Channel-wise Attentive Learning for Text
DependentSpeaker Verification [21.826585075806573]
This paper proposes a multi-task learning network with phoneme-aware and channel-wise attentive learning strategies for text-dependent Speaker Verification (SV)
The proposed system achieves outstanding results for textdependent SV.
arXiv Detail & Related papers (2021-06-25T09:11:18Z) - Content-Aware Speaker Embeddings for Speaker Diarisation [3.6398652091809987]
The content-aware speaker embeddings (CASE) approach is proposed.
Case factorises automatic speech recognition (ASR) from speaker recognition to focus on modelling speaker characteristics.
Case achieved a 17.8% relative speaker error rate reduction over conventional methods.
arXiv Detail & Related papers (2021-02-12T12:02:03Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.