A non-causal FFTNet architecture for speech enhancement
- URL: http://arxiv.org/abs/2006.04469v1
- Date: Mon, 8 Jun 2020 10:49:04 GMT
- Title: A non-causal FFTNet architecture for speech enhancement
- Authors: Muhammed PV Shifas, Nagaraj Adiga, Vassilis Tsiaras, Yannis Stylianou
- Abstract summary: We suggest a new parallel, non-causal and shallow waveform domain architecture for speech enhancement based on FFTNet.
By suggesting a shallow network and applying non-causality within certain limits, the suggested FFTNet uses much fewer parameters compared to other neural network based approaches.
- Score: 18.583426581177278
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we suggest a new parallel, non-causal and shallow waveform
domain architecture for speech enhancement based on FFTNet, a neural network
for generating high quality audio waveform. In contrast to other waveform based
approaches like WaveNet, FFTNet uses an initial wide dilation pattern. Such an
architecture better represents the long term correlated structure of speech in
the time domain, where noise is usually highly non-correlated, and therefore it
is suitable for waveform domain based speech enhancement. To further strengthen
this feature of FFTNet, we suggest a non-causal FFTNet architecture, where the
present sample in each layer is estimated from the past and future samples of
the previous layer. By suggesting a shallow network and applying non-causality
within certain limits, the suggested FFTNet for speech enhancement (SE-FFTNet)
uses much fewer parameters compared to other neural network based approaches
for speech enhancement like WaveNet and SEGAN. Specifically, the suggested
network has considerably reduced model parameters: 32% fewer compared to
WaveNet and 87% fewer compared to SEGAN. Finally, based on subjective and
objective metrics, SE-FFTNet outperforms WaveNet in terms of enhanced signal
quality, while it provides equally good performance as SEGAN. A Tensorflow
implementation of the architecture is provided at 1 .
Related papers
- TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise
Filter and Inverse Short Time Fourier Transform [21.896817015593122]
We introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain.
Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN.
Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications.
arXiv Detail & Related papers (2023-09-18T05:30:15Z) - FFC-SE: Fast Fourier Convolution for Speech Enhancement [1.0499611180329804]
Fast Fourier convolution (FFC) is the recently proposed neural operator showing promising performance in several computer vision problems.
In this work, we design neural network architectures which adapt FFC for speech enhancement.
We found that neural networks based on FFC outperform analogous convolutional models and show better or comparable results with other speech enhancement baselines.
arXiv Detail & Related papers (2022-04-06T18:52:47Z) - Speech-enhanced and Noise-aware Networks for Robust Speech Recognition [25.279902171523233]
A noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition.
The two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task.
Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively.
arXiv Detail & Related papers (2022-03-25T15:04:51Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Towards Theoretical Understanding of Flexible Transmitter Networks via
Approximation and Local Minima [74.30120779041428]
We study the theoretical properties of one-hidden-layer FTNet from the perspectives of approximation and local minima.
Our results indicate that FTNet can efficiently express target functions and has no concern about local minima.
arXiv Detail & Related papers (2021-11-11T02:41:23Z) - Time-domain Speech Enhancement with Generative Adversarial Learning [53.74228907273269]
This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN)
TSEGAN is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem.
In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN.
arXiv Detail & Related papers (2021-03-30T08:09:49Z) - Network Adjustment: Channel Search Guided by FLOPs Utilization Ratio [101.84651388520584]
This paper presents a new framework named network adjustment, which considers network accuracy as a function of FLOPs.
Experiments on standard image classification datasets and a wide range of base networks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2020-04-06T15:51:00Z) - FastWave: Accelerating Autoregressive Convolutional Neural Networks on
FPGA [27.50143717931293]
WaveNet is a deep autoregressive CNN composed of several stacked layers of dilated convolution.
We develop the first accelerator platformtextitFastWave for autoregressive convolutional neural networks.
arXiv Detail & Related papers (2020-02-09T06:15:09Z) - WaveTTS: Tacotron-based TTS with Joint Time-Frequency Domain Loss [74.11899135025503]
Tacotron-based text-to-speech (TTS) systems directly synthesize speech from text input.
We propose a new training scheme for Tacotron-based TTS, referred to as WaveTTS, that has 2 loss functions.
WaveTTS ensures both the quality of the acoustic features and the resulting speech waveform.
arXiv Detail & Related papers (2020-02-02T15:51:22Z) - Single Channel Speech Enhancement Using Temporal Convolutional Recurrent
Neural Networks [23.88788382262305]
temporal convolutional recurrent network (TCRN) is an end-to-end model that directly map noisy waveform to clean waveform.
We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks.
arXiv Detail & Related papers (2020-02-02T04:26:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.