Guided Speech Enhancement Network
- URL: http://arxiv.org/abs/2303.07486v1
- Date: Mon, 13 Mar 2023 21:48:20 GMT
- Title: Guided Speech Enhancement Network
- Authors: Yang Yang, Shao-Fu Shih, Hakan Erdogan, Jamie Menjay Lin, Chehung Lee,
Yunpeng Li, George Sung, Matthias Grundmann
- Abstract summary: Multi-microphone speech enhancement problem is often decomposed into two decoupled steps: a beamformer that provides spatial filtering and a single-channel speech enhancement model.
We propose a speech enhancement solution that takes both the raw microphone and beamformer outputs as the input for an ML model.
We name the ML module in our solution as GSENet, short for Guided Speech Enhancement Network.
- Score: 17.27704800294671
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High quality speech capture has been widely studied for both voice
communication and human computer interface reasons. To improve the capture
performance, we can often find multi-microphone speech enhancement techniques
deployed on various devices. Multi-microphone speech enhancement problem is
often decomposed into two decoupled steps: a beamformer that provides spatial
filtering and a single-channel speech enhancement model that cleans up the
beamformer output. In this work, we propose a speech enhancement solution that
takes both the raw microphone and beamformer outputs as the input for an ML
model. We devise a simple yet effective training scheme that allows the model
to learn from the cues of the beamformer by contrasting the two inputs and
greatly boost its capability in spatial rejection, while conducting the general
tasks of denoising and dereverberation. The proposed solution takes advantage
of classical spatial filtering algorithms instead of competing with them. By
design, the beamformer module then could be selected separately and does not
require a large amount of data to be optimized for a given form factor, and the
network model can be considered as a standalone module which is highly
transferable independently from the microphone array. We name the ML module in
our solution as GSENet, short for Guided Speech Enhancement Network. We
demonstrate its effectiveness on real world data collected on multi-microphone
devices in terms of the suppression of noise and interfering speech.
Related papers
- FINALLY: fast and universal speech enhancement with studio-like quality [7.207284147264852]
We address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion.
We study various feature extractors for perceptual loss to facilitate the stability of adversarial training.
We integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model.
arXiv Detail & Related papers (2024-10-08T11:16:03Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting [14.402357651227003]
We investigate the use of a speech SSL model for speech inpainting, that is reconstructing a missing portion of a speech signal from its surrounding context.
To that purpose, we combine an SSL encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role of a decoder.
arXiv Detail & Related papers (2024-05-30T14:41:39Z) - uSee: Unified Speech Enhancement and Editing with Conditional Diffusion
Models [57.71199494492223]
We propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner.
Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models.
arXiv Detail & Related papers (2023-10-02T04:36:39Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Efficient Monaural Speech Enhancement using Spectrum Attention Fusion [15.8309037583936]
We present an improvement for speech enhancement models that maintains the expressiveness of self-attention while significantly reducing model complexity.
We construct a convolutional module to replace several self-attention layers in a speech Transformer, allowing the model to more efficiently fuse spectral features.
Our proposed model is able to achieve comparable or better results against SOTA models but with significantly smaller parameters (0.58M) on the Voice Bank + DEMAND dataset.
arXiv Detail & Related papers (2023-08-04T11:39:29Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - Unifying Speech Enhancement and Separation with Gradient Modulation for
End-to-End Noise-Robust Speech Separation [23.758202121043805]
We propose a novel network to unify speech enhancement and separation with gradient modulation to improve noise-robustness.
Experimental results show that our approach achieves the state-of-the-art on large-scale Libri2Mix- and Libri3Mix-noisy datasets.
arXiv Detail & Related papers (2023-02-22T03:54:50Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - VoiceFilter-Lite: Streaming Targeted Voice Separation for On-Device
Speech Recognition [60.462770498366524]
We introduce VoiceFilter-Lite, a single-channel source separation model that runs on the device to preserve only the speech signals from a target user.
We show that such a model can be quantized as a 8-bit integer model and run in realtime.
arXiv Detail & Related papers (2020-09-09T14:26:56Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.