Similarity and Content-based Phonetic Self Attention for Speech
Recognition
- URL: http://arxiv.org/abs/2203.10252v1
- Date: Sat, 19 Mar 2022 05:35:26 GMT
- Title: Similarity and Content-based Phonetic Self Attention for Speech
Recognition
- Authors: Kyuhong Shim, Wonyong Sung
- Abstract summary: The proposed phonetic self-attention (phSA) is composed of two different types of phonetic attention.
We identify which parts of the original dot product are related to two different attention patterns and improve each part by simple modifications.
Our experiments on phoneme classification and speech recognition show that replacing SA with phSA for lower layers improves the recognition performance without increasing the latency and the parameter size.
- Score: 16.206467862132012
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Transformer-based speech recognition models have achieved great success due
to the self-attention (SA) mechanism that utilizes every frame in the feature
extraction process. Especially, SA heads in lower layers capture various
phonetic characteristics by the query-key dot product, which is designed to
compute the pairwise relationship between frames. In this paper, we propose a
variant of SA to extract more representative phonetic features. The proposed
phonetic self-attention (phSA) is composed of two different types of phonetic
attention; one is similarity-based and the other is content-based. In short,
similarity-based attention utilizes the correlation between frames while
content-based attention only considers each frame without being affected by
others. We identify which parts of the original dot product are related to two
different attention patterns and improve each part by simple modifications. Our
experiments on phoneme classification and speech recognition show that
replacing SA with phSA for lower layers improves the recognition performance
without increasing the latency and the parameter size.
Related papers
- Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Voice Quality and Pitch Features in Transformer-Based Speech Recognition [3.921076451326107]
We study the effects of incorporating voice quality and pitch features altogether and separately to a Transformer-based ASR model.
We find mean Word Error Rate relative reductions of up to 5.6% with the LibriSpeech benchmark.
arXiv Detail & Related papers (2021-12-21T17:49:06Z) - Correlating Subword Articulation with Lip Shapes for Embedding Aware
Audio-Visual Speech Enhancement [94.0676772764248]
We propose a visual embedding approach to improving embedding aware speech enhancement (EASE)
We first extract visual embedding from lip frames using a pre-trained phone or articulation place recognizer for visual-only EASE (VEASE)
Next, we extract audio-visual embedding from noisy speech and lip videos in an information intersection manner, utilizing a complementarity of audio and visual features for multi-modal EASE (MEASE)
arXiv Detail & Related papers (2020-09-21T01:26:19Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.