Related papers: Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments

Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments

URL: http://arxiv.org/abs/2410.05423v1
Date: Mon, 7 Oct 2024 18:39:59 GMT
Title: Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments
Authors: Sagarika Alavilli, Annesya Banerjee, Gasser Elbanna, Annika Magaro,
Abstract summary: We develop a transformer-based model that jointly performs speech recognition and speaker identification. We show that the joint model performs comparably to Whisper under clean conditions. Our results suggest that integrating voice representations with speech recognition can lead to more robust models under adversarial conditions.
Score: 0.2916558661202724
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current state-of-the-art speech recognition models are trained to map acoustic signals into sub-lexical units. While these models demonstrate superior performance, they remain vulnerable to out-of-distribution conditions such as background noise and speech augmentations. In this work, we hypothesize that incorporating speaker representations during speech recognition can enhance model robustness to noise. We developed a transformer-based model that jointly performs speech recognition and speaker identification. Our model utilizes speech embeddings from Whisper and speaker embeddings from ECAPA-TDNN, which are processed jointly to perform both tasks. We show that the joint model performs comparably to Whisper under clean conditions. Notably, the joint model outperforms Whisper in high-noise environments, such as with 8-speaker babble background noise. Furthermore, our joint model excels in handling highly augmented speech, including sine-wave and noise-vocoded speech. Overall, these results suggest that integrating voice representations with speech recognition can lead to more robust models under adversarial conditions.

Related papers

Speech to Speech Synthesis for Voice Impersonation [0.0]
We propose Speech to Speech Synthesis Network (STSSN), a model based on current state of the art systems.<n>We show that our proposed model is quite powerful, and succeeds in generating realistic audio samples.
arXiv Detail & Related papers (2026-02-13T01:22:25Z)
Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z)
Alternating Approach-Putt Models for Multi-Stage Speech Enhancement [2.5016653845378722]
We propose a post-processing neural network designed to mitigate artifacts introduced by speech enhancement models.<n>We demonstrate that alternating between a speech enhancement model and the proposed Putt model leads to improved speech quality.
arXiv Detail & Related papers (2025-08-14T08:18:42Z)
Rethinking Speaker Embeddings for Speech Generation: Sub-Center Modeling for Capturing Intra-Speaker Diversity [51.250471760075165]
We propose a novel speaker embedding network that employs multiple sub-centers per speaker class during training.<n>This sub-center modeling allows the embedding to capture a broader range of speaker-specific variations while maintaining speaker classification performance.
arXiv Detail & Related papers (2024-07-05T06:54:24Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
A Closer Look at Wav2Vec2 Embeddings for On-Device Single-Channel Speech Enhancement [16.900731393703648]
Self-supervised learned models have been found to be very effective for certain speech tasks. In this paper, we investigate the uses of SSL representations for single-channel speech enhancement in challenging conditions.
arXiv Detail & Related papers (2024-03-03T02:05:17Z)
Disentangling Voice and Content with Self-Supervision for Speaker Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z)
uSee: Unified Speech Enhancement and Editing with Conditional Diffusion Models [57.71199494492223]
We propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner. Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models.
arXiv Detail & Related papers (2023-10-02T04:36:39Z)
Pre-trained Model Representations and their Robustness against Noise for Speech Emotion Analysis [6.382013662443799]
We used multi-modal fusion representations from pre-trained models to generate state-of-the-art speech emotion estimation. We discovered that lexical representations are more robust to distortions compared to acoustic representations.
arXiv Detail & Related papers (2023-03-03T18:22:32Z)
Fine-grained Noise Control for Multispeaker Speech Synthesis [3.449700218265025]
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations. Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors.
arXiv Detail & Related papers (2022-04-11T13:13:55Z)
Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR. Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z)
High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z)
Audio ALBERT: A Lite BERT for Self-supervised Learning of Audio Representation [51.37980448183019]
We propose Audio ALBERT, a lite version of the self-supervised speech representation model. We show that Audio ALBERT is capable of achieving competitive performance with those huge models in the downstream tasks. In probing experiments, we find that the latent representations encode richer information of both phoneme and speaker than that of the last layer.
arXiv Detail & Related papers (2020-05-18T10:42:44Z)
Speaker Re-identification with Speaker Dependent Speech Enhancement [37.33388614967888]
This paper introduces a novel approach that cascades speech enhancement and speaker recognition. The proposed approach is evaluated using the Voxceleb1 dataset, which aims to assess speaker recognition in real world situations.
arXiv Detail & Related papers (2020-05-15T23:02:10Z)
Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks. To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain. The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.