Joint speech and overlap detection: a benchmark over multiple audio
setup and speech domains
- URL: http://arxiv.org/abs/2307.13012v1
- Date: Mon, 24 Jul 2023 14:29:21 GMT
- Title: Joint speech and overlap detection: a benchmark over multiple audio
setup and speech domains
- Authors: Martin Lebourdais (LIUM), Th\'eo Mariotte (LIUM, LAUM), Marie Tahon
(LIUM), Anthony Larcher (LIUM), Antoine Laurent (LIUM), Silvio Montresor
(LAUM), Sylvain Meignier (LIUM), Jean-Hugh Thomas (LAUM)
- Abstract summary: VAD and OSD can be trained jointly using a multi-class classification model.
This paper proposes a complete and new benchmark of different VAD and OSD models.
Our 2/3-class systems, which combine a Temporal Convolutional Network with speech representations adapted to the setup, outperform state-of-the-art results.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice activity and overlapped speech detection (respectively VAD and OSD) are
key pre-processing tasks for speaker diarization. The final segmentation
performance highly relies on the robustness of these sub-tasks. Recent studies
have shown VAD and OSD can be trained jointly using a multi-class
classification model. However, these works are often restricted to a specific
speech domain, lacking information about the generalization capacities of the
systems. This paper proposes a complete and new benchmark of different VAD and
OSD models, on multiple audio setups (single/multi-channel) and speech domains
(e.g. media, meeting...). Our 2/3-class systems, which combine a Temporal
Convolutional Network with speech representations adapted to the setup,
outperform state-of-the-art results. We show that the joint training of these
two tasks offers similar performances in terms of F1-score to two dedicated VAD
and OSD systems while reducing the training cost. This unique architecture can
also be used for single and multichannel speech processing.
Related papers
- Online speaker diarization of meetings guided by speech separation [0.0]
Overlapped speech is notoriously problematic for speaker diarization systems.
We introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings.
arXiv Detail & Related papers (2024-01-30T09:09:22Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Multi-microphone Automatic Speech Segmentation in Meetings Based on
Circular Harmonics Features [0.0]
We propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA)
Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.
arXiv Detail & Related papers (2023-06-07T09:09:00Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - SpeechPrompt v2: Prompt Tuning for Speech Classification Tasks [94.30385972442387]
We propose SpeechPrompt v2, a prompt tuning framework capable of performing a wide variety of speech classification tasks.
Experiment result shows that SpeechPrompt v2 achieves performance on par with prior works with less than 0.15M trainable parameters.
arXiv Detail & Related papers (2023-03-01T18:47:41Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.