Subtitles to Segmentation: Improving Low-Resource Speech-to-Text
Translation Pipelines
- URL: http://arxiv.org/abs/2010.09693v1
- Date: Mon, 19 Oct 2020 17:32:40 GMT
- Title: Subtitles to Segmentation: Improving Low-Resource Speech-to-Text
Translation Pipelines
- Authors: David Wan, Zhengping Jiang, Chris Kedzie, Elsbeth Turcan, Peter Bell
and Kathleen McKeown
- Abstract summary: We focus on improving ASR output segmentation in the context of low-resource language speech-to-text translation.
We use datasets of subtitles from TV shows and movies to train better ASR segmentation models.
We show that this noisy syntactic information can improve model accuracy.
- Score: 15.669334598926342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we focus on improving ASR output segmentation in the context of
low-resource language speech-to-text translation. ASR output segmentation is
crucial, as ASR systems segment the input audio using purely acoustic
information and are not guaranteed to output sentence-like segments. Since most
MT systems expect sentences as input, feeding in longer unsegmented passages
can lead to sub-optimal performance. We explore the feasibility of using
datasets of subtitles from TV shows and movies to train better ASR segmentation
models. We further incorporate part-of-speech (POS) tag and dependency label
information (derived from the unsegmented ASR outputs) into our segmentation
model. We show that this noisy syntactic information can improve model
accuracy. We evaluate our models intrinsically on segmentation quality and
extrinsically on downstream MT performance, as well as downstream tasks
including cross-lingual information retrieval (CLIR) tasks and human relevance
assessments. Our model shows improved performance on downstream tasks for
Lithuanian and Bulgarian.
Related papers
- Lightweight Audio Segmentation for Long-form Speech Translation [17.743473111298826]
We propose a segmentation model that achieves better speech translation quality with a small model size.
We also show that proper integration of the speech segmentation model into the underlying ST system is critical to improve overall translation quality at inference time.
arXiv Detail & Related papers (2024-06-15T08:02:15Z) - REBORN: Reinforcement-Learned Boundary Segmentation with Iterative Training for Unsupervised ASR [54.64158282822995]
We propose REBORN,Reinforcement-Learned Boundary with Iterative Training for Unsupervised ASR.
ReBORN alternates between training a segmentation model that predicts the boundaries of the segmental structures in speech signals and training the phoneme prediction model, whose input is the speech feature segmented by the segmentation model, to predict a phoneme transcription.
We conduct extensive experiments and find that under the same setting, REBORN outperforms all prior unsupervised ASR models on LibriSpeech, TIMIT, and five non-English languages in Multilingual LibriSpeech.
arXiv Detail & Related papers (2024-02-06T13:26:19Z) - Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset.
We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - End-to-End Simultaneous Speech Translation with Differentiable
Segmentation [21.03142288187605]
SimulST outputs translation while receiving the streaming speech inputs.
segmenting the speech inputs at unfavorable moments can disrupt the acoustic integrity and adversely affect the performance of the translation model.
We propose Differentiable segmentation (DiSeg) for SimulST to directly learn segmentation from the underlying translation model.
arXiv Detail & Related papers (2023-05-25T14:25:12Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - Dodging the Data Bottleneck: Automatic Subtitling with Automatically
Segmented ST Corpora [15.084508754409848]
Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles.
We propose a method to convert existing ST corpora into SubST resources without human intervention.
We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion.
arXiv Detail & Related papers (2022-09-21T19:06:36Z) - Segmenting Subtitles for Correcting ASR Segmentation Errors [11.854481771567503]
We propose a model for correcting the acoustic segmentation of ASR models for low-resource languages.
We train a neural tagging model for correcting ASR acoustic segmentation and show that it improves downstream performance.
arXiv Detail & Related papers (2021-04-16T03:04:10Z) - Contextualized Translation of Automatically Segmented Speech [20.334746967390164]
We train our models on randomly segmented data and compare two approaches: fine-tuning and adding the previous segment as context.
Our solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4.25 BLEU points.
arXiv Detail & Related papers (2020-08-05T17:52:25Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.