Dual-branch Attention-In-Attention Transformer for single-channel speech
enhancement
- URL: http://arxiv.org/abs/2110.06467v1
- Date: Wed, 13 Oct 2021 03:03:49 GMT
- Title: Dual-branch Attention-In-Attention Transformer for single-channel speech
enhancement
- Authors: Guochen Yu, Andong Li, Yutian Wang, Yinuo Guo, Hui Wang, Chengshi
Zheng
- Abstract summary: We propose a dual-branch attention-in-attention transformer dubbed DB-AIAT to handle both coarse- and fine-grained regions of the spectrum in parallel.
Within each branch, we propose a novel attention-in-attention transformer-based module to replace the conventional RNNs and temporal convolutional networks for temporal sequence modeling.
- Score: 6.894606865794746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Curriculum learning begins to thrive in the speech enhancement area, which
decouples the original spectrum estimation task into multiple easier sub-tasks
to achieve better performance. Motivated by that, we propose a dual-branch
attention-in-attention transformer dubbed DB-AIAT to handle both coarse- and
fine-grained regions of the spectrum in parallel. From a complementary
perspective, a magnitude masking branch is proposed to coarsely estimate the
overall magnitude spectrum, and simultaneously a complex refining branch is
elaborately designed to compensate for the missing spectral details and
implicitly derive phase information. Within each branch, we propose a novel
attention-in-attention transformer-based module to replace the conventional
RNNs and temporal convolutional networks for temporal sequence modeling.
Specifically, the proposed attention-in-attention transformer consists of
adaptive temporal-frequency attention transformer blocks and an adaptive
hierarchical attention module, aiming to capture long-term temporal-frequency
dependencies and further aggregate global hierarchical contextual information.
Experimental results on Voice Bank + DEMAND demonstrate that DB-AIAT yields
state-of-the-art performance (e.g., 3.31 PESQ, 94.7% STOI and 10.79dB SSNR)
over previous advanced systems with a relatively small model size (2.81M).
Related papers
- Electromyography-Based Gesture Recognition: Hierarchical Feature Extraction for Enhanced Spatial-Temporal Dynamics [0.7083699704958353]
We propose a lightweight squeeze-excitation deep learning-based multi stream spatial temporal dynamics time-varying feature extraction approach.
The proposed model was tested on the Ninapro DB2, DB4, and DB5 datasets, achieving accuracy rates of 96.41%, 92.40%, and 93.34%, respectively.
arXiv Detail & Related papers (2025-04-04T07:11:12Z) - A Hybrid Transformer-Mamba Network for Single Image Deraining [70.64069487982916]
Existing deraining Transformers employ self-attention mechanisms with fixed-range windows or along channel dimensions.
We introduce a novel dual-branch hybrid Transformer-Mamba network, denoted as TransMamba, aimed at effectively capturing long-range rain-related dependencies.
arXiv Detail & Related papers (2024-08-31T10:03:19Z) - Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [83.48423407316713]
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately.
Our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail.
Our method achieves a state-of-the-art FID score of 2.01 when integrated with the recent work SiT.
arXiv Detail & Related papers (2024-08-11T07:01:39Z) - Frequency-Adaptive Dilated Convolution for Semantic Segmentation [14.066404173580864]
We propose three strategies to improve individual phases of dilated convolution from the view of spectrum analysis.
We introduce Frequency-Adaptive Dilated Convolution (FADC), which adjusts dilation rates spatially based on local frequency components.
We design two plug-in modules to directly enhance effective bandwidth and receptive field size.
arXiv Detail & Related papers (2024-03-08T15:00:44Z) - Convolution and Attention Mixer for Synthetic Aperture Radar Image
Change Detection [41.38587746899477]
Synthetic aperture radar (SAR) image change detection is a critical task and has received increasing attentions in the remote sensing community.
Existing SAR change detection methods are mainly based on convolutional neural networks (CNNs)
We propose a convolution and attention mixer (CAMixer) to incorporate global attention.
arXiv Detail & Related papers (2023-09-21T12:28:23Z) - Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks [4.132793413136553]
We introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism.
The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention.
arXiv Detail & Related papers (2023-09-14T14:51:51Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - CMGAN: Conformer-based Metric GAN for Speech Enhancement [6.480967714783858]
We propose a conformer-based metric generative adversarial network (CMGAN) for time-frequency domain.
In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information.
The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech.
arXiv Detail & Related papers (2022-03-28T23:53:34Z) - Adaptive Frequency Learning in Two-branch Face Forgery Detection [66.91715092251258]
We propose Adaptively learn Frequency information in the two-branch Detection framework, dubbed AFD.
We liberate our network from the fixed frequency transforms, and achieve better performance with our data- and task-dependent transform layers.
arXiv Detail & Related papers (2022-03-27T14:25:52Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.