Encoder-decoder multimodal speaker change detection
- URL: http://arxiv.org/abs/2306.00680v1
- Date: Thu, 1 Jun 2023 13:55:23 GMT
- Title: Encoder-decoder multimodal speaker change detection
- Authors: Jee-weon Jung, Soonshin Seo, Hee-Soo Heo, Geonmin Kim, You Jin Kim,
Young-ki Kwon, Minjae Lee, Bong-Jin Lee
- Abstract summary: Speaker change detection (SCD) is essential for several applications.
multimodal SCD models, which utilise text modality in addition to audio, have shown improved performance.
This study builds upon two main proposals, a novel mechanism for modality fusion and the adoption of a encoder-decoder architecture.
- Score: 15.290910973040152
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of speaker change detection (SCD), which detects points where
speakers change in an input, is essential for several applications. Several
studies solved the SCD task using audio inputs only and have shown limited
performance. Recently, multimodal SCD (MMSCD) models, which utilise text
modality in addition to audio, have shown improved performance. In this study,
the proposed model are built upon two main proposals, a novel mechanism for
modality fusion and the adoption of a encoder-decoder architecture. Different
to previous MMSCD works that extract speaker embeddings from extremely short
audio segments, aligned to a single word, we use a speaker embedding extracted
from 1.5s. A transformer decoder layer further improves the performance of an
encoder-only MMSCD model. The proposed model achieves state-of-the-art results
among studies that report SCD performance and is also on par with recent work
that combines SCD with automatic speech recognition via human transcription.
Related papers
- Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems.
To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder.
Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates.
arXiv Detail & Related papers (2024-07-09T07:15:56Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Robust Wake-Up Word Detection by Two-stage Multi-resolution Ensembles [48.208214762257136]
It employs two models: a lightweight on-device model for real-time processing of the audio stream and a verification model on the server-side.
To protect privacy, audio features are sent to the cloud instead of raw audio.
arXiv Detail & Related papers (2023-10-17T16:22:18Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Rethinking Speech Recognition with A Multimodal Perspective via Acoustic
and Semantic Cooperative Decoding [29.80299587861207]
We propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR.
Unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively.
We show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively.
arXiv Detail & Related papers (2023-05-23T13:25:44Z) - Hybrid Transducer and Attention based Encoder-Decoder Modeling for
Speech-to-Text Tasks [28.440232737011453]
We propose a solution by combining Transducer and Attention based AED-Decoder (TAED) for speech-to-text tasks.
The new method leverages Transducer's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property.
We evaluate the proposed approach on the textscMuST-C dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks.
arXiv Detail & Related papers (2023-05-04T18:34:50Z) - Rethinking Audio-visual Synchronization for Active Speaker Detection [62.95962896690992]
Existing research on active speaker detection (ASD) does not agree on the definition of active speakers.
We propose a cross-modal contrastive learning strategy and apply positional encoding in attention modules for supervised ASD models to leverage the synchronization cue.
Experimental results suggest that our model can successfully detect unsynchronized speaking as not speaking, addressing the limitation of current models.
arXiv Detail & Related papers (2022-06-21T14:19:06Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - Speech enhancement aided end-to-end multi-task learning for voice
activity detection [40.44466027163059]
Speech enhancement is helpful to voice activity detection (VAD), but the performance improvement is limited.
We propose a speech enhancement aided end-to-end multi-task model for VAD.
mSI-SDR uses VAD information to mask the output of the speech enhancement decoder in the training process.
arXiv Detail & Related papers (2020-10-23T15:35:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.