Speaker and Language Change Detection using Wav2vec2 and Whisper
- URL: http://arxiv.org/abs/2302.09381v1
- Date: Sat, 18 Feb 2023 16:45:30 GMT
- Title: Speaker and Language Change Detection using Wav2vec2 and Whisper
- Authors: Tijn Berns, Nik Vaessen and David A. van Leeuwen
- Abstract summary: We investigate transformer networks pre-trained for automatic speech recognition for their ability to detect speaker and language changes in speech.
We show that these capabilities are definitely there, with speaker recognition equal error rates of the order of 10% and language detection error rates of a few percent.
- Score: 1.9594639581421422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate recent transformer networks pre-trained for automatic speech
recognition for their ability to detect speaker and language changes in speech.
We do this by simply adding speaker (change) or language targets to the labels.
For Wav2vec2 pre-trained networks, we also investigate if the representation
for the speaker change symbol can be conditioned to capture speaker identity
characteristics. Using a number of constructed data sets we show that these
capabilities are definitely there, with speaker recognition equal error rates
of the order of 10% and language detection error rates of a few percent. We
will publish the code for reproducibility.
Related papers
- Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation [45.51065693072839]
We propose to tackle streaming speaker change detection and gender classification by incorporating speaker embeddings into a transducer-based streaming end-to-end speech translation model.
Our experiments demonstrate that the proposed methods can achieve high accuracy for both speaker change detection and gender classification.
arXiv Detail & Related papers (2025-02-04T19:50:15Z) - ExPO: Explainable Phonetic Trait-Oriented Network for Speaker Verification [48.98768967435808]
We use computational method to verify if an utterance matches the identity of an enrolled speaker.
Despite much success, we have yet to develop a speaker verification system that offers explainable results.
A novel approach, Explainable Phonetic Trait-Oriented (ExPO) network, is proposed in this paper to introduce the speaker's phonetic trait.
arXiv Detail & Related papers (2025-01-10T05:53:37Z) - Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - Multilingual self-supervised speech representations improve the speech
recognition of low-resource African languages with codeswitching [65.74653592668743]
Finetuning self-supervised multilingual representations reduces absolute word error rates by up to 20%.
In circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.
arXiv Detail & Related papers (2023-11-25T17:05:21Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention
VAE [8.144263449781967]
Variational auto-encoder(VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings.
In this work, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance.
arXiv Detail & Related papers (2022-03-30T03:52:42Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Towards an Efficient Voice Identification Using Wav2Vec2.0 and HuBERT
Based on the Quran Reciters Dataset [0.0]
We develop a deep learning model for Arabic speakers identification by using Wav2Vec2.0 and HuBERT audio representation learning tools.
The experiments ensure that an arbitrary wave signal for a certain speaker can be identified with 98% and 97.1% accuracies.
arXiv Detail & Related papers (2021-11-11T17:44:50Z) - U-vectors: Generating clusterable speaker embedding from unlabeled data [0.0]
This paper introduces a speaker recognition strategy dealing with unlabeled data.
It generates clusterable embedding vectors from small fixed-size speech frames.
We conclude that the proposed approach achieves remarkable performance using pairwise architectures.
arXiv Detail & Related papers (2021-02-07T18:00:09Z) - Leveraging speaker attribute information using multi task learning for
speaker verification and diarization [33.60058873783114]
We propose a framework for making use of auxiliary label information, even when it is only available for speech corpora mismatched to the target application.
We show that by leveraging two additional forms of speaker attribute information, we improve the performance of our deep speaker embeddings for both verification and diarization tasks.
arXiv Detail & Related papers (2020-10-27T13:10:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.