Robust Multi-channel Speech Recognition using Frequency Aligned Network
- URL: http://arxiv.org/abs/2002.02520v1
- Date: Thu, 6 Feb 2020 21:47:39 GMT
- Title: Robust Multi-channel Speech Recognition using Frequency Aligned Network
- Authors: Taejin Park, Kenichi Kumatani, Minhua Wu, Shiva Sundaram
- Abstract summary: We use frequency aligned network for robust automatic speech recognition.
We show that our multi-channel acoustic model with a frequency aligned network shows up to 18% relative reduction in word error rate.
- Score: 23.397670239950187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional speech enhancement technique such as beamforming has known
benefits for far-field speech recognition. Our own work in frequency-domain
multi-channel acoustic modeling has shown additional improvements by training a
spatial filtering layer jointly within an acoustic model. In this paper, we
further develop this idea and use frequency aligned network for robust
multi-channel automatic speech recognition (ASR). Unlike an affine layer in the
frequency domain, the proposed frequency aligned component prevents one
frequency bin influencing other frequency bins. We show that this modification
not only reduces the number of parameters in the model but also significantly
and improves the ASR performance. We investigate effects of frequency aligned
network through ASR experiments on the real-world far-field data where users
are interacting with an ASR system in uncontrolled acoustic environments. We
show that our multi-channel acoustic model with a frequency aligned network
shows up to 18% relative reduction in word error rate.
Related papers
- Accelerating Inference of Networks in the Frequency Domain [8.125023712173686]
We propose performing network inference in the frequency domain to speed up networks whose frequency parameters are sparse.
In particular, we propose a frequency inference chain that is dual to the network inference in the spatial domain.
The proposed approach significantly improves accuracy in the case of a high speedup ratio (over 100x)
arXiv Detail & Related papers (2024-10-06T03:34:38Z) - Frequency-Aware Deepfake Detection: Improving Generalizability through
Frequency Space Learning [81.98675881423131]
This research addresses the challenge of developing a universal deepfake detector that can effectively identify unseen deepfake images.
Existing frequency-based paradigms have relied on frequency-level artifacts introduced during the up-sampling in GAN pipelines to detect forgeries.
We introduce a novel frequency-aware approach called FreqNet, centered around frequency domain learning, specifically designed to enhance the generalizability of deepfake detectors.
arXiv Detail & Related papers (2024-03-12T01:28:00Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Speech enhancement with frequency domain auto-regressive modeling [34.55703785405481]
Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation.
We propose a unified framework of speech dereverberation for improving the speech quality and the automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2023-09-24T03:25:51Z) - Adaptive Frequency Learning in Two-branch Face Forgery Detection [66.91715092251258]
We propose Adaptively learn Frequency information in the two-branch Detection framework, dubbed AFD.
We liberate our network from the fixed frequency transforms, and achieve better performance with our data- and task-dependent transform layers.
arXiv Detail & Related papers (2022-03-27T14:25:52Z) - Three-Way Deep Neural Network for Radio Frequency Map Generation and
Source Localization [67.93423427193055]
Monitoring wireless spectrum over spatial, temporal, and frequency domains will become a critical feature in beyond-5G and 6G communication technologies.
In this paper, we present a Generative Adversarial Network (GAN) machine learning model to interpolate irregularly distributed measurements across the spatial domain.
arXiv Detail & Related papers (2021-11-23T22:25:10Z) - Multi-Channel End-to-End Neural Diarization with Distributed Microphones [53.99406868339701]
We replace Transformer encoders in EEND with two types of encoders that process a multi-channel input.
We also propose a model adaptation method using only single-channel recordings.
arXiv Detail & Related papers (2021-10-10T03:24:03Z) - ChannelAugment: Improving generalization of multi-channel ASR by
training with input channel randomization [6.42706307642403]
End-to-end (E2E) multi-channel ASR systems show state-of-the-art performance in far-field ASR tasks.
Main limitation of such systems is that they are usually trained with data from a fixed array geometry.
We present a simple and effective data augmentation technique, which is based on randomly dropping channels in the multi-channel audio input during training.
arXiv Detail & Related papers (2021-09-23T09:13:47Z) - Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM
Neural Networks [3.730592618611028]
We use LSTMs to enhance spatial clustering based time-frequency masks.
We achieve both the signal modeling performance of multiple single-channel LSTM-DNN speech enhancers and the signal separation performance.
We evaluate the intelligibility of the output of each system using word error rate from a Kaldi automatic speech recognizer.
arXiv Detail & Related papers (2020-12-02T22:29:29Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.