Speech Enhancement for Virtual Meetings on Cellular Networks
- URL: http://arxiv.org/abs/2302.00868v1
- Date: Thu, 2 Feb 2023 04:35:48 GMT
- Title: Speech Enhancement for Virtual Meetings on Cellular Networks
- Authors: Hojeong Lee, Minseon Gwak, Kawon Lee, Minjeong Kim, Joseph Konan and
Ojas Bhargave
- Abstract summary: We study speech enhancement using deep learning (DL) for virtual meetings on cellular devices.
We collect a transmitted DNS (t-DNS) dataset using Zoom Meetings over T-Mobile network.
The goal of this project is to enhance the speech transmitted over the cellular networks using deep learning models.
- Score: 1.487576938041254
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study speech enhancement using deep learning (DL) for virtual meetings on
cellular devices, where transmitted speech has background noise and
transmission loss that affects speech quality. Since the Deep Noise Suppression
(DNS) Challenge dataset does not contain practical disturbance, we collect a
transmitted DNS (t-DNS) dataset using Zoom Meetings over T-Mobile network. We
select two baseline models: Demucs and FullSubNet. The Demucs is an end-to-end
model that takes time-domain inputs and outputs time-domain denoised speech,
and the FullSubNet takes time-frequency-domain inputs and outputs the energy
ratio of the target speech in the inputs. The goal of this project is to
enhance the speech transmitted over the cellular networks using deep learning
models.
Related papers
- IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - FINALLY: fast and universal speech enhancement with studio-like quality [7.207284147264852]
We address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion.
We study various feature extractors for perceptual loss to facilitate the stability of adversarial training.
We integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model.
arXiv Detail & Related papers (2024-10-08T11:16:03Z) - VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Automatic Speech Recognition for Hindi [0.6292138336765964]
The research involved developing a web application and designing a web interface for speech recognition.
The web application manages large volumes of audio files and their transcriptions, facilitating human correction of ASR transcripts.
The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine.
arXiv Detail & Related papers (2024-06-26T07:39:20Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Adaptation of Tongue Ultrasound-Based Silent Speech Interfaces Using
Spatial Transformer Networks [0.24466725954625895]
Silent speech interfaces (SSI) are able to synthesize intelligible speech from articulatory movement data under certain conditions.
The resulting models are speaker-specific, making a quick switch between users troublesome.
We extend our deep networks with a spatial transformer network (STN) module, capable of performing an affine transformation on the input images.
arXiv Detail & Related papers (2023-05-30T15:41:47Z) - Guided Speech Enhancement Network [17.27704800294671]
Multi-microphone speech enhancement problem is often decomposed into two decoupled steps: a beamformer that provides spatial filtering and a single-channel speech enhancement model.
We propose a speech enhancement solution that takes both the raw microphone and beamformer outputs as the input for an ML model.
We name the ML module in our solution as GSENet, short for Guided Speech Enhancement Network.
arXiv Detail & Related papers (2023-03-13T21:48:20Z) - Cellular Network Speech Enhancement: Removing Background and
Transmission Noise [0.0]
This paper demonstrates how to beat industrial performance and achieve 1.92 PESQ and 0.88 STOI, as well as superior acoustic fidelity, perceptual quality, and intelligibility in various metrics.
arXiv Detail & Related papers (2023-01-22T00:18:10Z) - A Study of Designing Compact Audio-Visual Wake Word Spotting System
Based on Iterative Fine-Tuning in Neural Network Pruning [57.28467469709369]
We investigate on designing a compact audio-visual wake word spotting (WWS) system by utilizing visual information.
We introduce a neural network pruning strategy via the lottery ticket hypothesis in an iterative fine-tuning manner (LTH-IF)
The proposed audio-visual system achieves significant performance improvements over the single-modality (audio-only or video-only) system under different noisy conditions.
arXiv Detail & Related papers (2022-02-17T08:26:25Z) - Decoupling Pronunciation and Language for End-to-end Code-switching
Automatic Speech Recognition [66.47000813920617]
We propose a decoupled transformer model to use monolingual paired data and unpaired text data.
The model is decoupled into two parts: audio-to-phoneme (A2P) network and phoneme-to-text (P2T) network.
By using monolingual data and unpaired text data, the decoupled transformer model reduces the high dependency on code-switching paired training data of E2E model.
arXiv Detail & Related papers (2020-10-28T07:46:15Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.