Audio-visual speech separation based on joint feature representation
with cross-modal attention
- URL: http://arxiv.org/abs/2203.02655v1
- Date: Sat, 5 Mar 2022 04:39:46 GMT
- Title: Audio-visual speech separation based on joint feature representation
with cross-modal attention
- Authors: Junwen Xiong, Peng Zhang, Lei Xie, Wei Huang, Yufei Zha, Yanning Zhang
- Abstract summary: This study is inspired by learning joint feature representations from audio and visual streams with attention mechanism.
To further improve audio-visual speech separation, the dense optical flow of lip motion is incorporated.
The overall improvement of the performance has demonstrated that the additional motion network effectively enhances the visual representation of the combined lip images and audio signal.
- Score: 45.210105822471256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal based speech separation has exhibited a specific advantage on
isolating the target character in multi-talker noisy environments.
Unfortunately, most of current separation strategies prefer a straightforward
fusion based on feature learning of each single modality, which is far from
sufficient consideration of inter-relationships between modalites. Inspired by
learning joint feature representations from audio and visual streams with
attention mechanism, in this study, a novel cross-modal fusion strategy is
proposed to benefit the whole framework with semantic correlations between
different modalities. To further improve audio-visual speech separation, the
dense optical flow of lip motion is incorporated to strengthen the robustness
of visual representation. The evaluation of the proposed work is performed on
two public audio-visual speech separation benchmark datasets. The overall
improvement of the performance has demonstrated that the additional motion
network effectively enhances the visual representation of the combined lip
images and audio signal, as well as outperforming the baseline in terms of all
metrics with the proposed cross-modal fusion.
Related papers
- Multi-modal Speech Emotion Recognition via Feature Distribution Adaptation Network [12.200776612016698]
We propose a novel deep inductive transfer learning framework, named feature distribution adaptation network.
Our method aims to use deep transfer learning strategies to align visual and audio feature distributions to obtain consistent representation of emotion.
arXiv Detail & Related papers (2024-10-29T13:13:30Z) - Audio-Visual Person Verification based on Recursive Fusion of Joint Cross-Attention [3.5803801804085347]
We introduce a joint cross-attentional model, where a joint audio-visual feature representation is employed in the cross-attention framework.
We also explore BLSTMs to improve the temporal modeling of audio-visual feature representations.
Results indicate that the proposed model shows promising improvement in fusion performance by adeptly capturing the intra-and inter-modal relationships.
arXiv Detail & Related papers (2024-03-07T16:57:45Z) - CM-PIE: Cross-modal perception for interactive-enhanced audio-visual
video parsing [23.85763377992709]
We propose a novel interactive-enhanced cross-modal perception method(CM-PIE), which can learn fine-grained features by applying a segment-based attention module.
We show that our model offers improved parsing performance on the Look, Listen, and Parse dataset.
arXiv Detail & Related papers (2023-10-11T14:15:25Z) - Audio-Visual Speaker Verification via Joint Cross-Attention [4.229744884478575]
Cross-modal joint attention to fully leverage the inter-modal complementary information and the intra-modal information for speaker verification.
We have shown that efficiently leveraging the intra- and inter-modal relationships significantly improves the performance of audio-visual fusion for speaker verification.
arXiv Detail & Related papers (2023-09-28T16:25:29Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - Correlating Subword Articulation with Lip Shapes for Embedding Aware
Audio-Visual Speech Enhancement [94.0676772764248]
We propose a visual embedding approach to improving embedding aware speech enhancement (EASE)
We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE)
Next, we extract audio-visual embedding from noisy speech and lip videos in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE)
arXiv Detail & Related papers (2020-09-21T01:26:19Z) - Audio-Visual Event Localization via Recursive Fusion by Joint
Co-Attention [25.883429290596556]
The major challenge in audio-visual event localization task lies in how to fuse information from multiple modalities effectively.
Recent works have shown that attention mechanism is beneficial to the fusion process.
We propose a novel joint attention mechanism with multimodal fusion methods for audio-visual event localization.
arXiv Detail & Related papers (2020-08-14T21:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.