Temporarily-Aware Context Modelling using Generative Adversarial
Networks for Speech Activity Detection
- URL: http://arxiv.org/abs/2004.01546v1
- Date: Thu, 2 Apr 2020 02:33:13 GMT
- Title: Temporarily-Aware Context Modelling using Generative Adversarial
Networks for Speech Activity Detection
- Authors: Tharindu Fernando, Sridha Sridharan, Mitchell McLaren, Darshana
Priyasad, Simon Denman, Clinton Fookes
- Abstract summary: We propose a novel joint learning framework for Speech Activity Detection (SAD)
We utilise generative adversarial networks to automatically learn a loss function for joint prediction of the frame-wise speech/ non-speech classifications together with the next audio segment.
We evaluate the proposed framework on multiple public benchmarks, including NIST OpenSAT' 17, AMI Meeting and HAVIC.
- Score: 43.662221486962274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel framework for Speech Activity Detection (SAD).
Inspired by the recent success of multi-task learning approaches in the speech
processing domain, we propose a novel joint learning framework for SAD. We
utilise generative adversarial networks to automatically learn a loss function
for joint prediction of the frame-wise speech/ non-speech classifications
together with the next audio segment. In order to exploit the temporal
relationships within the input signal, we propose a temporal discriminator
which aims to ensure that the predicted signal is temporally consistent. We
evaluate the proposed framework on multiple public benchmarks, including NIST
OpenSAT' 17, AMI Meeting and HAVIC, where we demonstrate its capability to
outperform state-of-the-art SAD approaches. Furthermore, our cross-database
evaluations demonstrate the robustness of the proposed approach across
different languages, accents, and acoustic environments.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Conversational speech recognition leveraging effective fusion methods
for cross-utterance language modeling [12.153618111267514]
We put forward disparate conversation history fusion methods for language modeling in automatic speech recognition.
A novel audio-fusion mechanism is introduced, which manages to fuse and utilize the acoustic embeddings of a current utterance and the semantic content of its corresponding conversation history.
To flesh out our ideas, we frame the ASR N-best hypothesis rescoring task as a prediction problem, leveraging BERT, an iconic pre-trained LM.
arXiv Detail & Related papers (2021-11-05T09:07:23Z) - With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition [95.99542238790038]
We propose a method that learns to attend to surrounding actions in order to improve recognition performance.
To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities.
We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-11-01T15:27:35Z) - Time-domain Speech Enhancement with Generative Adversarial Learning [53.74228907273269]
This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN)
TSEGAN is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem.
In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN.
arXiv Detail & Related papers (2021-03-30T08:09:49Z) - Investigating Cross-Domain Losses for Speech Enhancement [7.641695369120866]
Recent years have seen a surge in the number of available frameworks for speech enhancement (SE) and recognition.
In this study, we investigate the advantages of each set of approaches by separately examining their impact on speech intelligibility and quality.
arXiv Detail & Related papers (2020-10-20T17:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.