End-to-end Speech Recognition with similar length speech and text
- URL: http://arxiv.org/abs/2510.10453v1
- Date: Sun, 12 Oct 2025 05:20:48 GMT
- Title: End-to-end Speech Recognition with similar length speech and text
- Authors: Peng Fan, Wenping Wang, Fei Deng,
- Abstract summary: In previous research, various approaches have been employed to align text with speech.<n>We introduce two methods for alignment: a) Time Independence Loss (TIL) and b) Aligned Cross Entropy (AXE) Loss.<n> Experimental results on AISHELL-1 and AISHELL-2 subsets show that the proposed methods outperform the previous work and achieve a reduction of at least 86% in the number of frames.
- Score: 30.788306451696865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The mismatch of speech length and text length poses a challenge in automatic speech recognition (ASR). In previous research, various approaches have been employed to align text with speech, including the utilization of Connectionist Temporal Classification (CTC). In earlier work, a key frame mechanism (KFDS) was introduced, utilizing intermediate CTC outputs to guide downsampling and preserve keyframes, but traditional methods (CTC) failed to align speech and text appropriately when downsampling speech to a text-similar length. In this paper, we focus on speech recognition in those cases where the length of speech aligns closely with that of the corresponding text. To address this issue, we introduce two methods for alignment: a) Time Independence Loss (TIL) and b) Aligned Cross Entropy (AXE) Loss, which is based on edit distance. To enhance the information on keyframes, we incorporate frame fusion by applying weights and summing the keyframe with its context 2 frames. Experimental results on AISHELL-1 and AISHELL-2 dataset subsets show that the proposed methods outperform the previous work and achieve a reduction of at least 86\% in the number of frames.
Related papers
- Towards Efficient Speech-Text Jointly Decoding within One Speech Language Model [76.06585781346601]
Speech language models (Speech LMs) enable end-to-end speech-text modelling within a single model.<n>The choice of speech-text jointly decoding paradigm plays a critical role in performance, efficiency, and alignment quality.
arXiv Detail & Related papers (2025-06-04T23:53:49Z) - Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding [0.6798775532273751]
I propose Clip-TTS, a TTS method based on the Clip architecture.<n>This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage.<n>In terms of model architecture, I adopt the basic structure of Transformer, which allows Clip-TTS to achieve fast inference speeds.
arXiv Detail & Related papers (2025-02-26T07:09:33Z) - CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting [6.856101216726412]
This paper introduces a novel approach for streaming openvocabulary keyword spotting (KWS) with text-based keyword enrollment.
For every input frame, the proposed method finds the optimal alignment ending at the frame using connectionist temporal classification (CTC)
We then aggregates the frame-level acoustic embedding (AE) to obtain higher-level (i.e., character, word, or phrase) AE that aligns with the text embedding (TE) of the target keyword text.
arXiv Detail & Related papers (2024-06-12T06:44:40Z) - Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for Meeting Transcription [31.774032625780414]
TF-GridNet has shown impressive performance in speech separation in real reverberant conditions.<n>We extend the mixture encoder from a static two-speaker scenario to a natural meeting context.<n>Experiments result in a new state-of-the-art performance on LibriCSS using a single microphone.
arXiv Detail & Related papers (2023-09-15T14:57:28Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Improving Joint Speech-Text Representations Without Alignment [92.60384956736536]
We show that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length.
We argue that consistency losses could forgive length differences and simply assume the best alignment.
arXiv Detail & Related papers (2023-08-11T13:28:48Z) - token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired
Speech and Text [65.04385919645395]
token2vec is a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech.
Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction.
arXiv Detail & Related papers (2022-10-30T06:38:19Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Implicit Feature Alignment: Learn to Convert Text Recognizer to Text
Spotter [38.4211220941874]
We propose a simple, elegant and effective paradigm called Implicit Feature Alignment (IFA)
IFA can be easily integrated into current text recognizers, resulting in a novel inference mechanism called IFAinference.
We experimentally demonstrate that IFA achieves state-of-the-art performance on end-to-end document recognition tasks.
arXiv Detail & Related papers (2021-06-10T17:06:28Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.