Attention-based Neural Beamforming Layers for Multi-channel Speech
Recognition
- URL: http://arxiv.org/abs/2105.05920v1
- Date: Wed, 12 May 2021 19:32:24 GMT
- Title: Attention-based Neural Beamforming Layers for Multi-channel Speech
Recognition
- Authors: Bhargav Pulugundla, Yang Gao, Brian King, Gokce Keskin, Harish
Mallidi, Minhua Wu, Jasha Droppo, Roland Maas
- Abstract summary: We propose a 2D Conv-Attention module which combines convolution neural networks with attention for beamforming.
We apply self- and cross-attention to explicitly model the correlations within and between the input channels.
The results show a relative improvement of 3.8% in WER by the proposed model over the baseline neural beamformer.
- Score: 17.009051842682677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based beamformers have recently been shown to be effective for
multi-channel speech recognition. However, they are less capable at capturing
local information. In this work, we propose a 2D Conv-Attention module which
combines convolution neural networks with attention for beamforming. We apply
self- and cross-attention to explicitly model the correlations within and
between the input channels. The end-to-end 2D Conv-Attention model is compared
with a multi-head self-attention and superdirective-based neural beamformers.
We train and evaluate on an in-house multi-channel dataset. The results show a
relative improvement of 3.8% in WER by the proposed model over the baseline
neural beamformer.
Related papers
- Self-Supervised Learning for Multi-Channel Neural Transducer [3.045851438458641]
We explore a self-supervised learning method for a multi-channel end-to-end ASR model based on the wav2vec 2.0 framework.
We observed a 66% relative reduction in character error rate compared with the model without any pre-training for the far-field in-house dataset.
arXiv Detail & Related papers (2024-08-06T04:12:31Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - On Neural Architectures for Deep Learning-based Source Separation of
Co-Channel OFDM Signals [104.11663769306566]
We study the single-channel source separation problem involving frequency-division multiplexing (OFDM) signals.
We propose critical domain-informed modifications to the network parameterization, based on insights from OFDM structures.
arXiv Detail & Related papers (2023-03-11T16:29:13Z) - MIMO-DBnet: Multi-channel Input and Multiple Outputs DOA-aware
Beamforming Network for Speech Separation [55.533789120204055]
We propose an end-to-end beamforming network for direction guided speech separation given merely the mixture signal.
Specifically, we design a multi-channel input and multiple outputs architecture to predict the direction-of-arrival based embeddings and beamforming weights for each source.
arXiv Detail & Related papers (2022-12-07T01:52:40Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Decoupled Mixup for Generalized Visual Recognition [71.13734761715472]
We propose a novel "Decoupled-Mixup" method to train CNN models for visual recognition.
Our method decouples each image into discriminative and noise-prone regions, and then heterogeneously combines these regions to train CNN models.
Experiment results show the high generalization performance of our method on testing data that are composed of unseen contexts.
arXiv Detail & Related papers (2022-10-26T15:21:39Z) - Improving Neural Predictivity in the Visual Cortex with Gated Recurrent
Connections [0.0]
We aim to shift the focus on architectures that take into account lateral recurrent connections, a ubiquitous feature of the ventral visual stream, to devise adaptive receptive fields.
In order to increase the robustness of our approach and the biological fidelity of the activations, we employ specific data augmentation techniques.
arXiv Detail & Related papers (2022-03-22T17:27:22Z) - Three-class Overlapped Speech Detection using a Convolutional Recurrent
Neural Network [32.59704287230343]
The proposed approach classifies into three classes: non-speech, single speaker speech, and overlapped speech.
A convolutional recurrent neural network architecture is explored to benefit from both convolutional layer's capability to model local patterns and recurrent layer's ability to model sequential information.
The proposed overlapped speech detection model establishes a state-of-the-art performance with a precision of 0.6648 and a recall of 0.3222 on the DIHARD II evaluation set.
arXiv Detail & Related papers (2021-04-07T03:01:34Z) - Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T)
We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T.
Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z) - Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video.
The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution.
We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.