Conformer-based End-to-end Speech Recognition With Rotary Position
Embedding
- URL: http://arxiv.org/abs/2107.05907v1
- Date: Tue, 13 Jul 2021 08:07:22 GMT
- Title: Conformer-based End-to-end Speech Recognition With Rotary Position
Embedding
- Authors: Shengqiang Li, Menglong Xu, Xiao-Lei Zhang
- Abstract summary: We introduce rotary position embedding (RoPE) in the convolution-augmented transformer (conformer)
RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module.
Our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.
- Score: 11.428057887454008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based end-to-end speech recognition models have received
considerable attention in recent years due to their high training speed and
ability to model a long-range global context. Position embedding in the
transformer architecture is indispensable because it provides supervision for
dependency modeling between elements at different positions in the input
sequence. To make use of the time order of the input sequence, many works
inject some information about the relative or absolute position of the element
into the input sequence. In this work, we investigate various position
embedding methods in the convolution-augmented transformer (conformer) and
adopt a novel implementation named rotary position embedding (RoPE). RoPE
encodes absolute positional information into the input sequence by a rotation
matrix, and then naturally incorporates explicit relative position information
into a self-attention module. To evaluate the effectiveness of the RoPE method,
we conducted experiments on AISHELL-1 and LibriSpeech corpora. Results show
that the conformer enhanced with RoPE achieves superior performance in the
speech recognition task. Specifically, our model achieves a relative word error
rate reduction of 8.70% and 7.27% over the conformer on test-clean and
test-other sets of the LibriSpeech corpus respectively.
Related papers
- Toward Relative Positional Encoding in Spiking Transformers [52.62008099390541]
Spiking neural networks (SNNs) are bio-inspired networks that model how neurons in the brain communicate through discrete spikes.
In this paper, we introduce an approximate method for relative positional encoding (RPE) in Spiking Transformers.
arXiv Detail & Related papers (2025-01-28T06:42:37Z) - Benchmarking Rotary Position Embeddings for Automatic Speech Recognition [17.360059094663182]
Rotary Position Embedding (RoPE) encodes relative and absolute positional information in Transformer-based models.
RoPE consistently achieves lower error rates compared to the currently widely used relative positional embedding.
To facilitate further research, we release the implementation and all experimental recipes through the SpeechBrain toolkit.
arXiv Detail & Related papers (2025-01-10T15:30:46Z) - Real-Time Motion Prediction via Heterogeneous Polyline Transformer with
Relative Pose Encoding [121.08841110022607]
Existing agent-centric methods have demonstrated outstanding performance on public benchmarks.
We introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers.
By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods.
arXiv Detail & Related papers (2023-10-19T17:59:01Z) - SPE-Net: Boosting Point Cloud Analysis via Rotation Robustness
Enhancement [118.20816888815658]
We propose a novel deep architecture tailored for 3D point cloud applications, named as SPE-Net.
The embedded Selective Position variant' procedure relies on an attention mechanism that can effectively attend to the underlying rotation condition of the input.
We demonstrate the merits of the SPE-Net and the associated hypothesis on four benchmarks, showing evident improvements on both rotated and unrotated test data over SOTA methods.
arXiv Detail & Related papers (2022-11-15T15:59:09Z) - Deep Reinforcement Learning for IRS Phase Shift Design in
Spatiotemporally Correlated Environments [93.30657979626858]
We propose a deep actor-critic algorithm that accounts for channel correlations and destination motion.
We show that, when channels aretemporally correlated, the inclusion of the SNR in the state representation with function approximation in ways that inhibit convergence.
arXiv Detail & Related papers (2022-11-02T22:07:36Z) - Multiplicative Position-aware Transformer Models for Language
Understanding [17.476450946279037]
Transformer models, which leverage architectural improvements like self-attention, perform remarkably well on Natural Language Processing (NLP) tasks.
In this paper, we review major existing position embedding methods and compare their accuracy on downstream NLP tasks.
We also propose a novel multiplicative embedding method which leads to superior accuracy when compared to existing methods.
arXiv Detail & Related papers (2021-09-27T04:18:32Z) - Rethinking and Improving Relative Position Encoding for Vision
Transformer [61.559777439200744]
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens.
We propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE)
arXiv Detail & Related papers (2021-07-29T17:55:10Z) - CAPE: Encoding Relative Positions with Continuous Augmented Positional
Embeddings [33.87449556591022]
We propose an augmentation-based approach (CAPE) for absolute positional embeddings.
CAPE keeps the advantages of both absolute (simplicity and speed) and relative position embeddings (better generalization)
arXiv Detail & Related papers (2021-06-06T14:54:55Z) - RoFormer: Enhanced Transformer with Rotary Position Embedding [9.01819510933327]
We propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information.
RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation.
We evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets.
arXiv Detail & Related papers (2021-04-20T09:54:06Z) - Constructing interval variables via faceted Rasch measurement and
multitask deep learning: a hate speech application [63.10266319378212]
We propose a method for measuring complex variables on a continuous, interval spectrum by combining supervised deep learning with the Constructing Measures approach to faceted Rasch item response theory (IRT)
We demonstrate this new method on a dataset of 50,000 social media comments sourced from YouTube, Twitter, and Reddit and labeled by 11,000 U.S.-based Amazon Mechanical Turk workers.
arXiv Detail & Related papers (2020-09-22T02:15:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.