Multi-View Frequency-Attention Alternative to CNN Frontends for
Automatic Speech Recognition
- URL: http://arxiv.org/abs/2306.06954v1
- Date: Mon, 12 Jun 2023 08:37:36 GMT
- Title: Multi-View Frequency-Attention Alternative to CNN Frontends for
Automatic Speech Recognition
- Authors: Belen Alastruey, Lukas Drude, Jahn Heymann, Simon Wiesler
- Abstract summary: We show that global attention over frequencies is beneficial over local convolution.
We obtain 2.4 % relative word error rate reduction on a production scale replacing its convolutional neural network transducer.
- Score: 12.980843126905203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional frontends are a typical choice for Transformer-based automatic
speech recognition to preprocess the spectrogram, reduce its sequence length,
and combine local information in time and frequency similarly. However, the
width and height of an audio spectrogram denote different information, e.g.,
due to reverberation as well as the articulatory system, the time axis has a
clear left-to-right dependency. On the contrary, vowels and consonants
demonstrate very different patterns and occupy almost disjoint frequency
ranges. Therefore, we hypothesize, global attention over frequencies is
beneficial over local convolution. We obtain 2.4 % relative word error rate
reduction (rWERR) on a production scale Conformer transducer replacing its
convolutional neural network frontend by the proposed F-Attention module on
Alexa traffic. To demonstrate generalizability, we validate this on public
LibriSpeech data with a long short term memory-based listen attend and spell
architecture obtaining 4.6 % rWERR and demonstrate robustness to (simulated)
noisy conditions.
Related papers
- RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Spatiotemporal Augmentation on Selective Frequencies for Video
Representation Learning [36.352159541825095]
We propose FreqAug to filter data augmentation in frequency domain for video representation.
FreqAug pushes the model to focus more on dynamic features in the video via dropping spatial or temporal low-frequency components.
To verify the generality of the proposed method, we experiment with FreqAug on multiple self-supervised learning frameworks along with standard augmentations.
arXiv Detail & Related papers (2022-04-08T06:19:32Z) - Single microphone speaker extraction using unified time-frequency
Siamese-Unet [22.224446472612197]
We propose a Siamese-Unet architecture that uses both representations.
Siamese encoders are applied in the frequency-domain to infer the embedding of the noisy and reference spectra.
The model is trained with the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) loss to exploit the time-domain information.
arXiv Detail & Related papers (2022-03-06T11:45:30Z) - Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual
Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing.
Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video.
We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Speaker Representation Learning using Global Context Guided Channel and
Time-Frequency Transformations [67.18006078950337]
We use the global context information to enhance important channels and recalibrate salient time-frequency locations.
The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
arXiv Detail & Related papers (2020-09-02T01:07:29Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio
Signals [7.219077740523682]
We introduce the multi-time-scale (MTS) method to create flexibility towards temporal variations when analyzing audio data.
We evaluate MTS and standard convolutional layers in different architectures for emotion recognition from speech audio, using 4 datasets of different sizes.
arXiv Detail & Related papers (2020-03-06T12:28:04Z) - Robust Multi-channel Speech Recognition using Frequency Aligned Network [23.397670239950187]
We use frequency aligned network for robust automatic speech recognition.
We show that our multi-channel acoustic model with a frequency aligned network shows up to 18% relative reduction in word error rate.
arXiv Detail & Related papers (2020-02-06T21:47:39Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.