BERT for Joint Multichannel Speech Dereverberation with Spatial-aware
Tasks
- URL: http://arxiv.org/abs/2010.10892v2
- Date: Thu, 22 Oct 2020 02:41:39 GMT
- Title: BERT for Joint Multichannel Speech Dereverberation with Spatial-aware
Tasks
- Authors: Yang Jiao
- Abstract summary: We propose a method for joint multichannel speech dereverberation with two spatial-aware tasks.
The proposed method addresses involved tasks as a sequence to sequence mapping problem.
- Score: 6.876734825043823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a method for joint multichannel speech dereverberation with two
spatial-aware tasks: direction-of-arrival (DOA) estimation and speech
separation. The proposed method addresses involved tasks as a sequence to
sequence mapping problem, which is general enough for a variety of front-end
speech enhancement tasks. The proposed method is inspired by the excellent
sequence modeling capability of bidirectional encoder representation from
transformers (BERT). Instead of utilizing explicit representations from
pretraining in a self-supervised manner, we utilizes transformer encoded hidden
representations in a supervised manner. Both multichannel spectral magnitude
and spectral phase information of varying length utterances are encoded.
Experimental result demonstrates the effectiveness of the proposed method.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers [41.54004590821323]
We propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for multimodal semantic features.
Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer.
Unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features.
arXiv Detail & Related papers (2024-06-07T13:35:44Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Mixture Encoder Supporting Continuous Speech Separation for Meeting
Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation.
We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps.
Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z) - MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware
Beamforming Network for Speech Separation [55.533789120204055]
We propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal.
Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source.
arXiv Detail & Related papers (2022-12-07T01:52:40Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - LMR-CBT: Learning Modality-fused Representations with CB-Transformer for
Multimodal Emotion Recognition from Unaligned Multimodal Sequences [5.570499497432848]
We propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition.
We conduct word-aligned and unaligned experiments on three challenging datasets.
arXiv Detail & Related papers (2021-12-03T03:43:18Z) - Learning from Heterogeneous EEG Signals with Differentiable Channel
Reordering [51.633889765162685]
CHARM is a method for training a single neural network across inconsistent input channels.
We perform experiments on four EEG classification datasets and demonstrate the efficacy of CHARM.
arXiv Detail & Related papers (2020-10-21T12:32:34Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Translate Reverberated Speech to Anechoic Ones: Speech Dereverberation
with BERT [6.876734825043823]
Single channel speech dereverberation is considered in this work.
Inspired by the recent success of Bidirectional Representations from Transformers (BERT) model in the domain of Natural Language Processing (NLP), we investigate its applicability as backbone sequence model to enhance reverberated speech signal.
arXiv Detail & Related papers (2020-07-16T00:45:27Z) - End-to-End Whisper to Natural Speech Conversion using Modified
Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach.
We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features.
The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.