Related papers: DeepFilterNet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio

DeepFilterNet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio

URL: http://arxiv.org/abs/2205.05474v1
Date: Wed, 11 May 2022 13:19:41 GMT
Title: DeepFilterNet2: Towards Real-Time Speech Enhancement on Embedded Devices for Full-Band Audio
Authors: Hendrik Schr\"oter, Alberto N. Escalante-B., Tobias Rosenkranz, Andreas Maier
Abstract summary: DeepFilterNet exploits harmonic structure of speech allowing for efficient speech enhancement (SE) Several optimizations in the training procedure, data augmentation, and network structure result in state-of-the-art SE performance. This makes the algorithm applicable to run on embedded devices in real-time.
Score: 10.662665274373387
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Deep learning-based speech enhancement has seen huge improvements and recently also expanded to full band audio (48 kHz). However, many approaches have a rather high computational complexity and require big temporal buffers for real time usage e.g. due to temporal convolutions or attention. Both make those approaches not feasible on embedded devices. This work further extends DeepFilterNet, which exploits harmonic structure of speech allowing for efficient speech enhancement (SE). Several optimizations in the training procedure, data augmentation, and network structure result in state-of-the-art SE performance while reducing the real-time factor to 0.04 on a notebook Core-i5 CPU. This makes the algorithm applicable to run on embedded devices in real-time. The DeepFilterNet framework can be obtained under an open source license.

Related papers

Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Scalable Speech Enhancement with Dynamic Channel Pruning [0.44998333629984877]
Speech Enhancement (SE) is essential for improving productivity in remote collaborative environments. Deep learning models are highly effective at SE, but their computational demands make them impractical for embedded systems. We introduce Dynamic Channel Pruning to the audio domain for the first time and apply it to a custom convolutional architecture for SE.
arXiv Detail & Related papers (2024-12-22T18:21:08Z)
DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement [10.662665274373387]
We present a real-time speech enhancement demo using DeepFilterNet. Our model is able to match state-of-the-art speech enhancement benchmarks while achieving a real-time-factor of 0.19 on a single threaded notebook CPU.
arXiv Detail & Related papers (2023-05-14T19:09:35Z)
Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification. We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information. SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z)
End-to-End Neural Audio Coding for Real-Time Communications [22.699018098484707]
This paper proposes the TFNet, an end-to-end neural audio system with low latency for real-time communications (RTC) An interleaved structure is proposed for temporal filtering to capture both short-term and long-term temporal dependencies. With end-to-end optimization, the TFNet is jointly optimized with speech enhancement and packet loss concealment, yielding a one-for-all network for three tasks.
arXiv Detail & Related papers (2022-01-24T03:06:30Z)
Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for 5G and Beyond [70.81551587109833]
nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity. One of the main challenges comes from the real-time implementation of these algorithms. This paper explores the acceleration of APSM-based algorithms through massive parallelization.
arXiv Detail & Related papers (2022-01-13T15:20:45Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
DeepFilterNet: A Low Complexity Speech Enhancement Framework for Full-Band Audio based on Deep Filtering [9.200520879361916]
We propose DeepFilterNet, a two stage speech enhancement framework utilizing deep filtering. First, we enhance the spectral envelope using ERB-scaled gains modeling the human frequency perception. The second stage employs deep filtering to enhance the periodic components of speech.
arXiv Detail & Related papers (2021-10-11T20:03:52Z)
Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing. We propose a novel end-to-end streaming NAR speech recognition system. We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z)
Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks. We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z)
LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search [127.56834100382878]
We propose LightSpeech to automatically design more lightweight and efficient TTS models based on FastSpeech. Experiments show that the model discovered by our method achieves 15x model compression ratio and 6.5x inference speedup on CPU with on par voice quality.
arXiv Detail & Related papers (2021-02-08T07:45:06Z)
A.I. based Embedded Speech to Text Using Deepspeech [3.2221306786493065]
This paper shows the implementation process of speech recognition on a low-end computational device. Deepspeech is an open-source voice recognition that was using a neural network to convert speech spectrogram into a text transcript. This paper shows the experiment using Deepspeech version 0.1.0, 0.1.1, and 0.6.0, and there is some improvement on Deepspeech version 0.6.0, faster while processing speech-to-text on old hardware raspberry pi 3 b+.
arXiv Detail & Related papers (2020-02-25T08:27:41Z)
Towards High Performance Java-based Deep Learning Frameworks [0.22940141855172028]
Modern cloud services have set the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. In this paper we have employed TornadoVM, a state-of-the-art programming framework to transparently accelerate Deep Netts; a Java-based deep learning framework.
arXiv Detail & Related papers (2020-01-13T13:03:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.