Segmenting Subtitles for Correcting ASR Segmentation Errors
- URL: http://arxiv.org/abs/2104.07868v1
- Date: Fri, 16 Apr 2021 03:04:10 GMT
- Title: Segmenting Subtitles for Correcting ASR Segmentation Errors
- Authors: David Wan, Chris Kedzie, Faisal Ladhak, Elsbeth Turcan, Petra
Galu\v{s}\v{c}\'akov\'a, Elena Zotkina, Zhengping Jiang, Peter Bell, Kathleen
McKeown
- Abstract summary: We propose a model for correcting the acoustic segmentation of ASR models for low-resource languages.
We train a neural tagging model for correcting ASR acoustic segmentation and show that it improves downstream performance.
- Score: 11.854481771567503
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Typical ASR systems segment the input audio into utterances using purely
acoustic information, which may not resemble the sentence-like units that are
expected by conventional machine translation (MT) systems for Spoken Language
Translation. In this work, we propose a model for correcting the acoustic
segmentation of ASR models for low-resource languages to improve performance on
downstream tasks. We propose the use of subtitles as a proxy dataset for
correcting ASR acoustic segmentation, creating synthetic acoustic utterances by
modeling common error modes. We train a neural tagging model for correcting ASR
acoustic segmentation and show that it improves downstream performance on MT
and audio-document cross-language information retrieval (CLIR).
Related papers
- Speculative Speech Recognition by Audio-Prefixed Low-Rank Adaptation of Language Models [21.85677682584916]
speculative speech recognition (SSR)
We propose a model which does SSR by combining a RNN-Transducer-based ASR system with an audio-ed language model (LM)
arXiv Detail & Related papers (2024-07-05T16:52:55Z) - Lightweight Audio Segmentation for Long-form Speech Translation [17.743473111298826]
We propose a segmentation model that achieves better speech translation quality with a small model size.
We also show that proper integration of the speech segmentation model into the underlying ST system is critical to improve overall translation quality at inference time.
arXiv Detail & Related papers (2024-06-15T08:02:15Z) - REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR [54.64158282822995]
We propose REBORN,Reinforcement-Learned Boundary with Iterative Training for Unsupervised ASR.
ReBORN alternates between training a segmentation model that predicts the boundaries of the segmental structures in speech signals and training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription.
We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech.
arXiv Detail & Related papers (2024-02-06T13:26:19Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z) - Subtitles to Segmentation: Improving Low-Resource Speech-to-Text
Translation Pipelines [15.669334598926342]
We focus on improving ASR output segmentation in the context of low-resource language speech-to-text translation.
We use datasets of subtitles from TV shows and movies to train better ASR segmentation models.
We show that this noisy syntactic information can improve model accuracy.
arXiv Detail & Related papers (2020-10-19T17:32:40Z) - Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.
We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech.
Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Statistical Context-Dependent Units Boundary Correction for Corpus-based
Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems.
Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.