Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers
- URL: http://arxiv.org/abs/2104.10328v1
- Date: Wed, 21 Apr 2021 03:05:12 GMT
- Title: Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers
- Authors: Yusuke Kida, Tatsuya Komatsu, Masahito Togami
- Abstract summary: This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
- Score: 49.403414751667135
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel label-synchronous speech-to-text alignment
technique for automatic speech recognition (ASR). The speech-to-text alignment
is a problem of splitting long audio recordings with un-aligned transcripts
into utterance-wise pairs of speech and text. Unlike conventional methods based
on frame-synchronous prediction, the proposed method re-defines the
speech-to-text alignment as a label-synchronous text mapping problem. This
enables an accurate alignment benefiting from the strong inference ability of
the state-of-the-art attention-based encoder-decoder models, which cannot be
applied to the conventional methods. Two different Transformer models named
forward Transformer and backward Transformer are respectively used for
estimating an initial and final tokens of a given speech segment based on
end-of-sentence prediction with teacher-forcing. Experiments using the corpus
of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an
accurate utterance-wise alignment, that matches the manually annotated
alignment with as few as 0.2% errors. It is also confirmed that a
Transformer-based hybrid CTC/Attention ASR model using the aligned speech and
text pairs as an additional training data reduces character error rates
relatively up to 59.0%, which is significantly better than 39.0% reduction by a
conventional alignment method based on connectionist temporal classification
model.
Related papers
- Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices [8.77712061194924]
We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models.
Our algorithm performs grapheme-to-phoneme (G2P) conversion directly from wordpieces into phonemes, avoiding explicit word representations.
We achieved up to a 15.2% relative reduction in sentence error rate (SER) on a test set with contextually relevant entities.
arXiv Detail & Related papers (2024-09-24T21:42:25Z) - Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition [18.50957174600796]
Solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals.
Currently, the separator produces artefacts which often degrade ASR performance.
This paper proposes a transcription-free method for joint training using only audio signals.
arXiv Detail & Related papers (2024-06-13T08:20:58Z) - Whispering LLaMA: A Cross-Modal Generative Error Correction Framework
for Speech Recognition [10.62060432965311]
We introduce a new cross-modal fusion technique designed for generative error correction in automatic speech recognition (ASR)
Our methodology leverages both acoustic information and external linguistic representations to generate accurate speech transcription contexts.
arXiv Detail & Related papers (2023-10-10T09:04:33Z) - Towards zero-shot Text-based voice editing using acoustic context
conditioning, utterance embeddings, and reference encoders [14.723225542605105]
Text-based voice editing (TBVE) uses synthetic output from text-to-speech (TTS) systems to replace words in an original recording.
Recent work has used neural models to produce edited speech similar to the original speech in terms of clarity, speaker identity, and prosody.
This work focuses on the zero-shot approach which avoids finetuning altogether.
arXiv Detail & Related papers (2022-10-28T10:31:44Z) - Iterative pseudo-forced alignment by acoustic CTC loss for
self-supervised ASR domain adaptation [80.12316877964558]
High-quality data labeling from specific domains is costly and human time-consuming.
We propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm.
arXiv Detail & Related papers (2022-10-27T07:23:08Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.