Correlating Subword Articulation with Lip Shapes for Embedding Aware
Audio-Visual Speech Enhancement
- URL: http://arxiv.org/abs/2009.09561v1
- Date: Mon, 21 Sep 2020 01:26:19 GMT
- Title: Correlating Subword Articulation with Lip Shapes for Embedding Aware
Audio-Visual Speech Enhancement
- Authors: Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Bao-Cai Yin, Chin-Hui Lee
- Abstract summary: We propose a visual embedding approach to improving embedding aware speech enhancement (EASE)
We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE)
Next, we extract audio-visual embedding from noisy speech and lip videos in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE)
- Score: 94.0676772764248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a visual embedding approach to improving embedding
aware speech enhancement (EASE) by synchronizing visual lip frames at the phone
and place of articulation levels. We first extract visual embedding from lip
frames using a pre-trained phone or articulation place recognizer for
visual-only EASE (VEASE). Next, we extract audio-visual embedding from noisy
speech and lip videos in an information intersection manner, utilizing a
complementarity of audio and visual features for multi-modal EASE (MEASE).
Experiments on the TCD-TIMIT corpus corrupted by simulated additive noises show
that our proposed subword based VEASE approach is more effective than
conventional embedding at the word level. Moreover, visual embedding at the
articulation place level, leveraging upon a high correlation between place of
articulation and lip shapes, shows an even better performance than that at the
phone level. Finally the proposed MEASE framework, incorporating both audio and
visual embedding, yields significantly better speech quality and
intelligibility than those obtained with the best visual-only and audio-only
EASE systems.
Related papers
- Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Audio-Visual Speaker Verification via Joint Cross-Attention [4.229744884478575]
Cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification.
We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification.
arXiv Detail & Related papers (2023-09-28T16:25:29Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - On the Role of Visual Cues in Audiovisual Speech Enhancement [21.108094726214784]
We show how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal.
One byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications.
arXiv Detail & Related papers (2020-04-25T01:00:03Z) - How to Teach DNNs to Pay Attention to the Visual Modality in Speech
Recognition [10.74796391075403]
This study investigates the inner workings of AV Align and visualises the audio-visual alignment patterns.
We find that AV Align learns to align acoustic and visual representations of speech at the frame level on TCD-TIMIT in a generally monotonic pattern.
We propose a regularisation method which involves predicting lip-related Action Units from visual representations.
arXiv Detail & Related papers (2020-04-17T13:59:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.