TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down
Fusion
- URL: http://arxiv.org/abs/2401.14185v1
- Date: Thu, 25 Jan 2024 13:47:22 GMT
- Title: TDFNet: An Efficient Audio-Visual Speech Separation Model with Top-down
Fusion
- Authors: Samuel Pegg, Kai Li, Xiaolin Hu
- Abstract summary: Top-Down-Fusion Net (TDFNet) is a state-of-the-art (SOTA) model for audio-visual speech separation.
TDFNet achieves a performance increase of up to 10% across all performance metrics compared with the previous SOTA method CTCNet.
- Score: 21.278294846228935
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Audio-visual speech separation has gained significant traction in recent
years due to its potential applications in various fields such as speech
recognition, diarization, scene analysis and assistive technologies. Designing
a lightweight audio-visual speech separation network is important for
low-latency applications, but existing methods often require higher
computational costs and more parameters to achieve better separation
performance. In this paper, we present an audio-visual speech separation model
called Top-Down-Fusion Net (TDFNet), a state-of-the-art (SOTA) model for
audio-visual speech separation, which builds upon the architecture of TDANet,
an audio-only speech separation method. TDANet serves as the architectural
foundation for the auditory and visual networks within TDFNet, offering an
efficient model with fewer parameters. On the LRS2-2Mix dataset, TDFNet
achieves a performance increase of up to 10\% across all performance metrics
compared with the previous SOTA method CTCNet. Remarkably, these results are
achieved using fewer parameters and only 28\% of the multiply-accumulate
operations (MACs) of CTCNet. In essence, our method presents a highly effective
and efficient solution to the challenges of speech separation within the
audio-visual domain, making significant strides in harnessing visual
information optimally.
Related papers
- Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Audio-Visual Speech Separation in Noisy Environments with a Lightweight
Iterative Model [35.171785986428425]
We propose Audio-Visual Lightweight ITerative model (AVLIT) to perform audio-visual speech separation in noisy environments.
Our architecture consists of an audio branch and a video branch, with iterative A-FRCNN blocks sharing weights for each modality.
Experiments demonstrate the superiority of our model in both settings with respect to various audio-only and audio-visual baselines.
arXiv Detail & Related papers (2023-05-31T20:09:50Z) - Exploring Turkish Speech Recognition via Hybrid CTC/Attention
Architecture and Multi-feature Fusion Network [1.514049362441354]
This paper studies a series of speech recognition tuning technologies.
To maximize the use of effective feature information, this paper proposes a new feature extractor LSPC.
Our model performance achieves comparable to that of advanced End-to-End models.
arXiv Detail & Related papers (2023-03-22T04:11:35Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.