DeepFilterNet2: Towards Real-Time Speech Enhancement on Embedded Devices
for Full-Band Audio
- URL: http://arxiv.org/abs/2205.05474v1
- Date: Wed, 11 May 2022 13:19:41 GMT
- Title: DeepFilterNet2: Towards Real-Time Speech Enhancement on Embedded Devices
for Full-Band Audio
- Authors: Hendrik Schr\"oter, Alberto N. Escalante-B., Tobias Rosenkranz,
Andreas Maier
- Abstract summary: DeepFilterNet exploits harmonic structure of speech allowing for efficient speech enhancement (SE)
Several optimizations in the training procedure, data augmentation, and network structure result in state-of-the-art SE performance.
This makes the algorithm applicable to run on embedded devices in real-time.
- Score: 10.662665274373387
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Deep learning-based speech enhancement has seen huge improvements and
recently also expanded to full band audio (48 kHz). However, many approaches
have a rather high computational complexity and require big temporal buffers
for real time usage e.g. due to temporal convolutions or attention. Both make
those approaches not feasible on embedded devices. This work further extends
DeepFilterNet, which exploits harmonic structure of speech allowing for
efficient speech enhancement (SE). Several optimizations in the training
procedure, data augmentation, and network structure result in state-of-the-art
SE performance while reducing the real-time factor to 0.04 on a notebook
Core-i5 CPU. This makes the algorithm applicable to run on embedded devices in
real-time. The DeepFilterNet framework can be obtained under an open source
license.
Related papers
- DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement [10.662665274373387]
We present a real-time speech enhancement demo using DeepFilterNet.
Our model is able to match state-of-the-art speech enhancement benchmarks while achieving a real-time-factor of 0.19 on a single threaded notebook CPU.
arXiv Detail & Related papers (2023-05-14T19:09:35Z) - Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification.
We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information.
SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z) - End-to-End Neural Audio Coding for Real-Time Communications [22.699018098484707]
This paper proposes the TFNet, an end-to-end neural audio system with low latency for real-time communications (RTC)
An interleaved structure is proposed for temporal filtering to capture both short-term and long-term temporal dependencies.
With end-to-end optimization, the TFNet is jointly optimized with speech enhancement and packet loss concealment, yielding a one-for-all network for three tasks.
arXiv Detail & Related papers (2022-01-24T03:06:30Z) - Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for
5G and Beyond [70.81551587109833]
nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity.
One of the main challenges comes from the real-time implementation of these algorithms.
This paper explores the acceleration of APSM-based algorithms through massive parallelization.
arXiv Detail & Related papers (2022-01-13T15:20:45Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - DeepFilterNet: A Low Complexity Speech Enhancement Framework for
Full-Band Audio based on Deep Filtering [9.200520879361916]
We propose DeepFilterNet, a two stage speech enhancement framework utilizing deep filtering.
First, we enhance the spectral envelope using ERB-scaled gains modeling the human frequency perception.
The second stage employs deep filtering to enhance the periodic components of speech.
arXiv Detail & Related papers (2021-10-11T20:03:52Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - LightSpeech: Lightweight and Fast Text to Speech with Neural
Architecture Search [127.56834100382878]
We propose LightSpeech to automatically design more lightweight and efficient TTS models based on FastSpeech.
Experiments show that the model discovered by our method achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality.
arXiv Detail & Related papers (2021-02-08T07:45:06Z) - A.I. based Embedded Speech to Text Using Deepspeech [3.2221306786493065]
This paper shows the implementation process of speech recognition on a low-end computational device.
Deepspeech is an open-source voice recognition that was using a neural network to convert speech spectrogram into a text transcript.
This paper shows the experiment using Deepspeech version 0.1.0, 0.1.1, and 0.6.0, and there is some improvement on Deepspeech version 0.6.0, faster while processing speech-to-text on old hardware raspberry pi 3 b+.
arXiv Detail & Related papers (2020-02-25T08:27:41Z) - Towards High Performance Java-based Deep Learning Frameworks [0.22940141855172028]
Modern cloud services have set the demand for fast and efficient data processing.
This demand is common among numerous application domains, such as deep learning, data mining, and computer vision.
In this paper we have employed TornadoVM, a state-of-the-art programming framework to transparently accelerate Deep Netts; a Java-based deep learning framework.
arXiv Detail & Related papers (2020-01-13T13:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.