Iterative pseudo-forced alignment by acoustic CTC loss for
self-supervised ASR domain adaptation
- URL: http://arxiv.org/abs/2210.15226v1
- Date: Thu, 27 Oct 2022 07:23:08 GMT
- Title: Iterative pseudo-forced alignment by acoustic CTC loss for
self-supervised ASR domain adaptation
- Authors: Fernando L\'opez and Jordi Luque
- Abstract summary: High-quality data labeling from specific domains is costly and human time-consuming.
We propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm.
- Score: 80.12316877964558
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-quality data labeling from specific domains is costly and human
time-consuming. In this work, we propose a self-supervised domain adaptation
method, based upon an iterative pseudo-forced alignment algorithm. The produced
alignments are employed to customize an end-to-end Automatic Speech Recognition
(ASR) and iteratively refined. The algorithm is fed with frame-wise character
posteriors produced by a seed ASR, trained with out-of-domain data, and
optimized throughout a Connectionist Temporal Classification (CTC) loss. The
alignments are computed iteratively upon a corpus of broadcast TV. The process
is repeated by reducing the quantity of text to be aligned or expanding the
alignment window until finding the best possible audio-text alignment. The
starting timestamps, or temporal anchors, are produced uniquely based on the
confidence score of the last aligned utterance. This score is computed with the
paths of the CTC-alignment matrix. With this methodology, no human-revised text
references are required. Alignments from long audio files with low-quality
transcriptions, like TV captions, are filtered out by confidence score and
ready for further ASR adaptation. The obtained results, on both the Spanish
RTVE2022 and CommonVoice databases, underpin the feasibility of using CTC-based
systems to perform: highly accurate audio-text alignments, domain adaptation
and semi-supervised training of end-to-end ASR.
Related papers
- CTC-aligned Audio-Text Embedding for Streaming Open-vocabulary Keyword Spotting [6.856101216726412]
This paper introduces a novel approach for streaming openvocabulary keyword spotting (KWS) with text-based keyword enrollment.
For every input frame, the proposed method finds the optimal alignment ending at the frame using connectionist temporal classification (CTC)
We then aggregates the frame-level acoustic embedding (AE) to obtain higher-level (i.e., character, word, or phrase) AE that aligns with the text embedding (TE) of the target keyword text.
arXiv Detail & Related papers (2024-06-12T06:44:40Z) - Sequence Transduction with Graph-based Supervision [96.04967815520193]
We present a new transducer objective function that generalizes the RNN-T loss to accept a graph representation of the labels.
We demonstrate that transducer-based ASR with CTC-like lattice achieves better results compared to standard RNN-T.
arXiv Detail & Related papers (2021-11-01T21:51:42Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - Cross-domain Speech Recognition with Unsupervised Character-level
Distribution Matching [60.8427677151492]
We propose CMatch, a Character-level distribution matching method to perform fine-grained adaptation between each character in two domains.
Experiments on the Libri-Adapt dataset show that our proposed approach achieves 14.39% and 16.50% relative Word Error Rate (WER) reduction on both cross-device and cross-environment ASR.
arXiv Detail & Related papers (2021-04-15T14:36:54Z) - Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.
We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech.
Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.