Towards Intelligibility-Oriented Audio-Visual Speech Enhancement
- URL: http://arxiv.org/abs/2111.09642v1
- Date: Thu, 18 Nov 2021 11:47:37 GMT
- Title: Towards Intelligibility-Oriented Audio-Visual Speech Enhancement
- Authors: Tassadaq Hussain, Mandar Gogate, Kia Dashtipour, Amir Hussain
- Abstract summary: We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function.
Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
- Score: 8.19144665585397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing deep learning (DL) based speech enhancement approaches are generally
optimised to minimise the distance between clean and enhanced speech features.
These often result in improved speech quality however they suffer from a lack
of generalisation and may not deliver the required speech intelligibility in
real noisy situations. In an attempt to address these challenges, researchers
have explored intelligibility-oriented (I-O) loss functions and integration of
audio-visual (AV) information for more robust speech enhancement (SE). In this
paper, we introduce DL based I-O SE algorithms exploiting AV information, which
is a novel and previously unexplored research direction. Specifically, we
present a fully convolutional AV SE model that uses a modified short-time
objective intelligibility (STOI) metric as a training cost function. To the
best of our knowledge, this is the first work that exploits the integration of
AV modalities with an I-O based loss function for SE. Comparative experimental
results demonstrate that our proposed I-O AV SE framework outperforms
audio-only (AO) and AV models trained with conventional distance-based loss
functions, in terms of standard objective evaluation measures when dealing with
unseen speakers and noises.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive
Learning [15.673602262069531]
Active Speaker Detection (ASD) is a task to determine whether a person is speaking or not in a series of video frames.
We propose TalkNCE, a novel talk-aware contrastive loss.
Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.
arXiv Detail & Related papers (2023-09-21T17:59:11Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - A Novel Speech Intelligibility Enhancement Model based on
CanonicalCorrelation and Deep Learning [12.913738983870621]
We present a canonical correlation based short-time objective intelligibility (CC-STOI) cost function to train a fully convolutional neural network (FCN) model.
We show that our CC-STOI based speech enhancement framework outperforms state-of-the-art DL models trained with conventional distance-based and STOI-based loss functions.
arXiv Detail & Related papers (2022-02-11T16:48:41Z) - Towards Robust Real-time Audio-Visual Speech Enhancement [8.183895606832623]
We present a novel framework for low latency speaker-independent AV SE.
In particular, a generative adversarial networks (GAN) is proposed to address the practical issue of visual imperfections in AV SE.
We propose a deep neural network based real-time AV SE model that takes into account the cleaned visual speech output from GAN to deliver more robust SE.
arXiv Detail & Related papers (2021-12-16T17:54:45Z) - Improved Lite Audio-Visual Speech Enhancement [27.53117725152492]
We propose a lite audio-visual speech enhancement (LAVSE) algorithm for a car-driving scenario.
In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems.
We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset.
arXiv Detail & Related papers (2020-08-30T17:29:19Z) - Characterizing Speech Adversarial Examples Using Self-Attention U-Net
Enhancement [102.48582597586233]
We present a U-Net based attention model, U-Net$_At$, to enhance adversarial speech signals.
We conduct experiments on the automatic speech recognition (ASR) task with adversarial audio attacks.
arXiv Detail & Related papers (2020-03-31T02:16:34Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.