DPATD: Dual-Phase Audio Transformer for Denoising
- URL: http://arxiv.org/abs/2310.19588v1
- Date: Mon, 30 Oct 2023 14:44:59 GMT
- Title: DPATD: Dual-Phase Audio Transformer for Denoising
- Authors: Junhui Li, Pu Wang, Jialu Li, Xinzhe Wang, Youshan Zhang
- Abstract summary: We propose a dual-phase audio transformer for denoising (DPATD), a novel model to organize transformer layers in a deep structure to learn clean audio sequences for denoising.
Our memory-compressed explainable attention is efficient and converges faster compared to the frequently used self-attention module.
- Score: 25.097894984130733
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Recent high-performance transformer-based speech enhancement models
demonstrate that time domain methods could achieve similar performance as
time-frequency domain methods. However, time-domain speech enhancement systems
typically receive input audio sequences consisting of a large number of time
steps, making it challenging to model extremely long sequences and train models
to perform adequately. In this paper, we utilize smaller audio chunks as input
to achieve efficient utilization of audio information to address the above
challenges. We propose a dual-phase audio transformer for denoising (DPATD), a
novel model to organize transformer layers in a deep structure to learn clean
audio sequences for denoising. DPATD splits the audio input into smaller
chunks, where the input length can be proportional to the square root of the
original sequence length. Our memory-compressed explainable attention is
efficient and converges faster compared to the frequently used self-attention
module. Extensive experiments demonstrate that our model outperforms
state-of-the-art methods.
Related papers
- Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - On Time Domain Conformer Models for Monaural Speech Separation in Noisy
Reverberant Acoustic Environments [20.592466025674643]
Time domain conformers (TD-Conformers) are an analogue of the DP approach in that they also process local and global context sequentially.
Best TD-Conformer achieves 14.6 dB and 21.2 dB SISDR improvement on the WHAMR and WSJ0-2Mix benchmarks.
arXiv Detail & Related papers (2023-10-09T20:02:11Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation [72.7915031238824]
Large diffusion models have been successful in text-to-audio (T2A) synthesis tasks.
They often suffer from common issues such as semantic misalignment and poor temporal consistency.
We propose Make-an-Audio 2, a latent diffusion-based T2A method that builds on the success of Make-an-Audio.
arXiv Detail & Related papers (2023-05-29T10:41:28Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST)
In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z) - Audiomer: A Convolutional Transformer for Keyword Spotting [0.0]
We introduce Audiomer, where we combine 1D Residual Networks with Performer Attention to achieve state-of-the-art performance in Keyword Spotting.
Audiomer allows for deployment in compute-constrained devices and training on smaller datasets.
arXiv Detail & Related papers (2021-09-21T15:28:41Z) - Voice2Series: Reprogramming Acoustic Models for Time Series
Classification [65.94154001167608]
Voice2Series is a novel end-to-end approach that reprograms acoustic models for time series classification.
We show that V2S either outperforms or is tied with state-of-the-art methods on 20 tasks, and improves their average accuracy by 1.84%.
arXiv Detail & Related papers (2021-06-17T07:59:15Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.