End-to-End Neural Audio Coding for Real-Time Communications
- URL: http://arxiv.org/abs/2201.09429v2
- Date: Tue, 25 Jan 2022 02:14:49 GMT
- Title: End-to-End Neural Audio Coding for Real-Time Communications
- Authors: Xue Jiang, Xiulian Peng, Chengyu Zheng, Huaying Xue, Yuan Zhang, Yan
Lu
- Abstract summary: This paper proposes the TFNet, an end-to-end neural audio system with low latency for real-time communications (RTC)
An interleaved structure is proposed for temporal filtering to capture both short-term and long-term temporal dependencies.
With end-to-end optimization, the TFNet is jointly optimized with speech enhancement and packet loss concealment, yielding a one-for-all network for three tasks.
- Score: 22.699018098484707
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep-learning based methods have shown their advantages in audio coding over
traditional ones but limited attention has been paid on real-time
communications (RTC). This paper proposes the TFNet, an end-to-end neural audio
codec with low latency for RTC. It takes an encoder-temporal filtering-decoder
paradigm that seldom being investigated in audio coding. An interleaved
structure is proposed for temporal filtering to capture both short-term and
long-term temporal dependencies. Furthermore, with end-to-end optimization, the
TFNet is jointly optimized with speech enhancement and packet loss concealment,
yielding a one-for-all network for three tasks. Both subjective and objective
results demonstrate the efficiency of the proposed TFNet.
Related papers
- RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise
Filter and Inverse Short Time Fourier Transform [21.896817015593122]
We introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain.
Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN.
Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications.
arXiv Detail & Related papers (2023-09-18T05:30:15Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net
Encoder With Multiple STFTs [1.8047694351309207]
FastFit is a novel neural vocoder architecture that replaces the U-Net encoder with multiple short-time Fourier transforms (STFTs)
We show that FastFit achieves nearly twice the generation speed of baseline-based vocoders while maintaining high sound quality.
arXiv Detail & Related papers (2023-05-18T09:05:17Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - Latent-Domain Predictive Neural Speech Coding [22.65761249591267]
This paper introduces latent-domain predictive coding into the VQ-VAE framework.
We propose the TF-Codec for low-latency neural speech coding in an end-to-end manner.
Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than at 9 kbps.
arXiv Detail & Related papers (2022-07-18T03:18:08Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action
Localization [96.73647162960842]
TAL is a fundamental yet challenging task in video understanding.
Existing TAL methods rely on pre-training a video encoder through action classification supervision.
We introduce a novel low-fidelity end-to-end (LoFi) video encoder pre-training method.
arXiv Detail & Related papers (2021-03-28T22:18:14Z) - Scaling Up Online Speech Recognition Using ConvNets [33.75588539732141]
We design an online end-to-end speech recognition system based on Time-Depth Separable ( TDS) convolutions and Connectionist Temporal Classification (CTC)
We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy.
The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate.
arXiv Detail & Related papers (2020-01-27T12:55:02Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.