Speaker-Utterance Dual Attention for Speaker and Utterance Verification
- URL: http://arxiv.org/abs/2008.08901v1
- Date: Thu, 20 Aug 2020 11:37:57 GMT
- Title: Speaker-Utterance Dual Attention for Speaker and Utterance Verification
- Authors: Tianchi Liu, Rohan Kumar Das, Maulik Madhavi, Shengmei Shen, Haizhou
Li
- Abstract summary: We implement an idea of speaker-utterance dual attention (SUDA) in a unified neural network.
The proposed SUDA features an attention mask mechanism to learn the interaction between the speaker and utterance information streams.
- Score: 77.2346078109261
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study a novel technique that exploits the interaction
between speaker traits and linguistic content to improve both speaker
verification and utterance verification performance. We implement an idea of
speaker-utterance dual attention (SUDA) in a unified neural network. The dual
attention refers to an attention mechanism for the two tasks of speaker and
utterance verification. The proposed SUDA features an attention mask mechanism
to learn the interaction between the speaker and utterance information streams.
This helps to focus only on the required information for respective task by
masking the irrelevant counterparts. The studies conducted on RSR2015 corpus
confirm that the proposed SUDA outperforms the framework without attention mask
as well as several competitive systems for both speaker and utterance
verification.
Related papers
- Investigation of Speaker Representation for Target-Speaker Speech Processing [49.110228525976794]
This paper aims to address a fundamental question: what is the preferred speaker embedding for target-speaker speech processing tasks?
For the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector.
Our analysis reveals speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
arXiv Detail & Related papers (2024-10-15T03:58:13Z) - Improving Speaker Diarization using Semantic Information: Joint Pairwise
Constraints Propagation [53.01238689626378]
We propose a novel approach to leverage semantic information in speaker diarization systems.
We introduce spoken language understanding modules to extract speaker-related semantic information.
We present a novel framework to integrate these constraints into the speaker diarization pipeline.
arXiv Detail & Related papers (2023-09-19T09:13:30Z) - Exploring Speaker-Related Information in Spoken Language Understanding
for Better Speaker Diarization [7.673971221635779]
We propose methods to extract speaker-related information from semantic content in multi-party meetings.
Experiments on both AISHELL-4 and AliMeeting datasets show that our method achieves consistent improvements over acoustic-only speaker diarization systems.
arXiv Detail & Related papers (2023-05-22T11:14:19Z) - In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Best of Both Worlds: Multi-task Audio-Visual Automatic Speech
Recognition and Active Speaker Detection [9.914246432182873]
In noisy conditions, automatic speech recognition can benefit from the addition of visual signals coming from a video of the speaker's face.
Active speaker detection involves selecting at each moment in time which of the visible faces corresponds to the audio.
Recent work has shown that we can solve both problems simultaneously by employing an attention mechanism over the competing video tracks of the speakers' faces.
This work closes this gap in active speaker detection accuracy by presenting a single model that can be jointly trained with a multi-task loss.
arXiv Detail & Related papers (2022-05-10T23:03:19Z) - Unified Speech-Text Pre-training for Speech Translation and Recognition [113.31415771943162]
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition.
The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning.
It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset.
arXiv Detail & Related papers (2022-04-11T20:59:51Z) - A Machine of Few Words -- Interactive Speaker Recognition with
Reinforcement Learning [35.36769027019856]
We present a new paradigm for automatic speaker recognition that we call Interactive Speaker Recognition (ISR)
In this paradigm, the recognition system aims to incrementally build a representation of the speakers by requesting personalized utterances.
We show that our method achieves excellent performance while using little speech signal amounts.
arXiv Detail & Related papers (2020-08-07T12:44:08Z) - Active Speakers in Context [88.22935329360618]
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker.
This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons.
Our experiments show that a structured feature ensemble already benefits the active speaker detection performance.
arXiv Detail & Related papers (2020-05-20T01:14:23Z) - Multi-Task Learning with Auxiliary Speaker Identification for
Conversational Emotion Recognition [32.439818455554885]
We exploit speaker identification (SI) as an auxiliary task to enhance the utterance representation in conversations.
By this method, we can learn better speaker-aware contextual representations from the additional SI corpus.
Experiments on two benchmark datasets demonstrate that the proposed architecture is highly effective for CER.
arXiv Detail & Related papers (2020-03-03T12:25:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.