Tackling real noisy reverberant meetings with all-neural source
separation, counting, and diarization system
- URL: http://arxiv.org/abs/2003.03987v1
- Date: Mon, 9 Mar 2020 09:25:38 GMT
- Title: Tackling real noisy reverberant meetings with all-neural source
separation, counting, and diarization system
- Authors: Keisuke Kinoshita, Marc Delcroix, Shoko Araki, Tomohiro Nakatani
- Abstract summary: We propose an all-neural approach that jointly solves source separation, speaker diarization and source counting problems.
We experimentally show that the all-neural approach can perform effective speech enhancement, and simultaneously outperform state-of-the-art systems.
- Score: 105.09252216321
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic meeting analysis is an essential fundamental technology required to
let, e.g. smart devices follow and respond to our conversations. To achieve an
optimal automatic meeting analysis, we previously proposed an all-neural
approach that jointly solves source separation, speaker diarization and source
counting problems in an optimal way (in a sense that all the 3 tasks can be
jointly optimized through error back-propagation). It was shown that the method
could well handle simulated clean (noiseless and anechoic) dialog-like data,
and achieved very good performance in comparison with several conventional
methods. However, it was not clear whether such all-neural approach would be
successfully generalized to more complicated real meeting data containing more
spontaneously-speaking speakers, severe noise and reverberation, and how it
performs in comparison with the state-of-the-art systems in such scenarios. In
this paper, we first consider practical issues required for improving the
robustness of the all-neural approach, and then experimentally show that, even
in real meeting scenarios, the all-neural approach can perform effective speech
enhancement, and simultaneously outperform state-of-the-art systems.
Related papers
- From Modular to End-to-End Speaker Diarization [3.079020586262228]
We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx.
We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps.
We show how this method generating simulated conversations'' allows for better performance than using a previously proposed method for creating simulated mixtures'' when training the popular EEND.
arXiv Detail & Related papers (2024-06-27T15:09:39Z) - A unified multichannel far-field speech recognition system: combining
neural beamforming with attention based end-to-end model [14.795953417531907]
We propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen, Spell, Attend (LAS) speech recognition system.
The proposed method achieve 19.26% improvement when compared with a strong baseline.
arXiv Detail & Related papers (2024-01-05T07:11:13Z) - On monoaural speech enhancement for automatic recognition of real noisy
speech using mixture invariant training [33.79711018198589]
We extend the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data.
It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech.
The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts.
arXiv Detail & Related papers (2022-05-03T19:37:58Z) - A combined approach to the analysis of speech conversations in a contact
center domain [2.575030923243061]
We describe an experimentation with a speech analytics process for an Italian contact center, that deals with call recordings extracted from inbound or outbound flows.
First, we illustrate in detail the development of an in-house speech-to-text solution, based on Kaldi framework.
Then, we evaluate and compare different approaches to the semantic tagging of call transcripts.
Finally, a decision tree inducer, called J48S, is applied to the problem of tagging.
arXiv Detail & Related papers (2022-03-12T10:03:20Z) - Towards Robust Online Dialogue Response Generation [62.99904593650087]
We argue that this can be caused by a discrepancy between training and real-world testing.
We propose a hierarchical sampling-based method consisting of both utterance-level sampling and semi-utterance-level sampling.
arXiv Detail & Related papers (2022-03-07T06:51:41Z) - Self-attention fusion for audiovisual emotion recognition with
incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition.
We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Multimodal Attention Fusion for Target Speaker Extraction [108.73502348754842]
We propose a novel attention mechanism for multi-modal fusion and its training methods.
Our proposals improve signal to distortion ratio (SDR) by 1.0 dB over conventional fusion mechanisms on simulated data.
arXiv Detail & Related papers (2021-02-02T05:59:35Z) - Integrating end-to-end neural and clustering-based diarization: Getting
the best of both worlds [71.36164750147827]
Clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors.
End-to-end neural diarization (EEND) directly predicts diarization labels using a neural network.
We propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers.
arXiv Detail & Related papers (2020-10-26T06:33:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.