Related papers: BERT for Joint Multichannel Speech Dereverberation with Spatial-aware Tasks

BERT for Joint Multichannel Speech Dereverberation with Spatial-aware Tasks

URL: http://arxiv.org/abs/2010.10892v2
Date: Thu, 22 Oct 2020 02:41:39 GMT
Title: BERT for Joint Multichannel Speech Dereverberation with Spatial-aware Tasks
Authors: Yang Jiao
Abstract summary: We propose a method for joint multichannel speech dereverberation with two spatial-aware tasks. The proposed method addresses involved tasks as a sequence to sequence mapping problem.
Score: 6.876734825043823
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a method for joint multichannel speech dereverberation with two spatial-aware tasks: direction-of-arrival (DOA) estimation and speech separation. The proposed method addresses involved tasks as a sequence to sequence mapping problem, which is general enough for a variety of front-end speech enhancement tasks. The proposed method is inspired by the excellent sequence modeling capability of bidirectional encoder representation from transformers (BERT). Instead of utilizing explicit representations from pretraining in a self-supervised manner, we utilizes transformer encoded hidden representations in a supervised manner. Both multichannel spectral magnitude and spectral phase information of varying length utterances are encoded. Experimental result demonstrates the effectiveness of the proposed method.

Related papers

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired. We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z)
MA-AVT: Modality Alignment for Parameter-Efficient Audio-Visual Transformers [41.54004590821323]
We propose MA-AVT, a new parameter-efficient audio-visual transformer employing deep modality alignment for multimodal semantic features. Specifically, we introduce joint unimodal and multimodal token learning for aligning the two modalities with a frozen modality-shared transformer. Unlike prior work that only aligns coarse features from the output of unimodal encoders, we introduce blockwise contrastive learning to align coarse-to-fine-grain hierarchical features.
arXiv Detail & Related papers (2024-06-07T13:35:44Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation. We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps. Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z)
MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware Beamforming Network for Speech Separation [55.533789120204055]
We propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal. Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source.
arXiv Detail & Related papers (2022-12-07T01:52:40Z)
Multi-Class Segmentation from Aerial Views using Recursive Noise Diffusion [16.86600007830682]
We propose an end-to-end multi-class semantic segmentation diffusion model. Our method achieves promising results on the UAVid dataset. Being the first iteration of this method, it shows great promise for future improvements.
arXiv Detail & Related papers (2022-12-01T18:59:26Z)
Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem. We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z)
LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences [5.570499497432848]
We propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition. We conduct word-aligned and unaligned experiments on three challenging datasets.
arXiv Detail & Related papers (2021-12-03T03:43:18Z)
Learning from Heterogeneous EEG Signals with Differentiable Channel Reordering [51.633889765162685]
CHARM is a method for training a single neural network across inconsistent input channels. We perform experiments on four EEG classification datasets and demonstrate the efficacy of CHARM.
arXiv Detail & Related papers (2020-10-21T12:32:34Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
Translate Reverberated Speech to Anechoic Ones: Speech Dereverberation with BERT [6.876734825043823]
Single channel speech dereverberation is considered in this work. Inspired by the recent success of Bidirectional Representations from Transformers (BERT) model in the domain of Natural Language Processing (NLP), we investigate its applicability as backbone sequence model to enhance reverberated speech signal.
arXiv Detail & Related papers (2020-07-16T00:45:27Z)
End-to-End Whisper to Natural Speech Conversion using Modified Transformer Network [0.8399688944263843]
We introduce whisper-to-natural-speech conversion using sequence-to-sequence approach. We investigate different features like Mel frequency cepstral coefficients and smoothed spectral features. The proposed networks are trained end-to-end using supervised approach for feature-to-feature transformation.
arXiv Detail & Related papers (2020-04-20T14:47:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.