Cellular Network Speech Enhancement: Removing Background and
Transmission Noise
- URL: http://arxiv.org/abs/2301.09027v1
- Date: Sun, 22 Jan 2023 00:18:10 GMT
- Title: Cellular Network Speech Enhancement: Removing Background and
Transmission Noise
- Authors: Amanda Shu, Hamza Khalid, Haohui Liu, Shikhar Agnihotri, Joseph Konan,
Ojas Bhargave
- Abstract summary: This paper demonstrates how to beat industrial performance and achieve 1.92 PESQ and 0.88 STOI, as well as superior acoustic fidelity, perceptual quality, and intelligibility in various metrics.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The primary objective of speech enhancement is to reduce background noise
while preserving the target's speech. A common dilemma occurs when a speaker is
confined to a noisy environment and receives a call with high background and
transmission noise. To address this problem, the Deep Noise Suppression (DNS)
Challenge focuses on removing the background noise with the next-generation
deep learning models to enhance the target's speech; however, researchers fail
to consider Voice Over IP (VoIP) applications their transmission noise.
Focusing on Google Meet and its cellular application, our work achieves
state-of-the-art performance on the Google Meet To Phone Track of the VoIP DNS
Challenge. This paper demonstrates how to beat industrial performance and
achieve 1.92 PESQ and 0.88 STOI, as well as superior acoustic fidelity,
perceptual quality, and intelligibility in various metrics.
Related papers
- Speech Boosting: Low-Latency Live Speech Enhancement for TWS Earbuds [7.360661203298394]
This paper introduces a speech enhancement solution tailored for true wireless stereo (TWS) earbuds on-device usage.
The solution was specifically designed to support conversations in noisy environments, with active noise cancellation (ANC) activated.
arXiv Detail & Related papers (2024-09-27T12:47:36Z) - TRNet: Two-level Refinement Network leveraging Speech Enhancement for Noise Robust Speech Emotion Recognition [29.756961194844717]
The proposed TRNet substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
Results validate that the proposed system substantially promotes the robustness of the proposed system in both matched and unmatched noisy environments.
arXiv Detail & Related papers (2024-04-19T16:09:17Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Speech Enhancement for Virtual Meetings on Cellular Networks [1.487576938041254]
We study speech enhancement using deep learning (DL) for virtual meetings on cellular devices.
We collect a transmitted DNS (t-DNS) dataset using Zoom Meetings over T-Mobile network.
The goal of this project is to enhance the speech transmitted over the cellular networks using deep learning models.
arXiv Detail & Related papers (2023-02-02T04:35:48Z) - Universal Speech Enhancement with Score-based Diffusion [21.294665965300922]
We present a universal speech enhancement system that tackles 55 different distortions at the same time.
Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network.
We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners.
arXiv Detail & Related papers (2022-06-07T07:32:32Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Interactive Feature Fusion for End-to-End Noise-Robust Speech
Recognition [25.84784710031567]
We propose an interactive feature fusion network (IFF-Net) for noise-robust speech recognition.
Experimental results show that the proposed method achieves absolute word error rate (WER) reduction of 4.1% over the best baseline.
Our further analysis indicates that the proposed IFF-Net can complement some missing information in the over-suppressed enhanced feature.
arXiv Detail & Related papers (2021-10-11T13:40:07Z) - PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation
Extraction [90.55375210094995]
Speech enhancement aims to improve the perceptual quality of the speech signal by suppression of the background noise.
We propose an end-to-end deep learning framework, dubbed PL-EESR, for robust speaker representation extraction.
arXiv Detail & Related papers (2021-10-03T07:05:29Z) - Speech Enhancement for Wake-Up-Word detection in Voice Assistants [60.103753056973815]
Keywords spotting and in particular Wake-Up-Word (WUW) detection is a very important task for voice assistants.
This paper proposes a Speech Enhancement model adapted to the task of WUW detection.
It aims at increasing the recognition rate and reducing the false alarms in the presence of these types of noises.
arXiv Detail & Related papers (2021-01-29T18:44:05Z) - CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile
Application [63.2243126704342]
This study presents a deep learning-based speech signal-processing mobile application known as CITISEN.
The CITISEN provides three functions: speech enhancement (SE), model adaptation (MA), and background noise conversion (BNC)
Compared with the noisy speech signals, the enhanced speech signals achieved about 6% and 33% of improvements.
arXiv Detail & Related papers (2020-08-21T02:04:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.