Dodging the Data Bottleneck: Automatic Subtitling with Automatically
Segmented ST Corpora
- URL: http://arxiv.org/abs/2209.10608v1
- Date: Wed, 21 Sep 2022 19:06:36 GMT
- Title: Dodging the Data Bottleneck: Automatic Subtitling with Automatically
Segmented ST Corpora
- Authors: Sara Papi, Alina Karakanta, Matteo Negri, Marco Turchi
- Abstract summary: Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles.
We propose a method to convert existing ST corpora into SubST resources without human intervention.
We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion.
- Score: 15.084508754409848
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Speech translation for subtitling (SubST) is the task of automatically
translating speech data into well-formed subtitles by inserting subtitle breaks
compliant to specific displaying guidelines. Similar to speech translation
(ST), model training requires parallel data comprising audio inputs paired with
their textual translations. In SubST, however, the text has to be also
annotated with subtitle breaks. So far, this requirement has represented a
bottleneck for system development, as confirmed by the dearth of publicly
available SubST corpora. To fill this gap, we propose a method to convert
existing ST corpora into SubST resources without human intervention. We build a
segmenter model that automatically segments texts into proper subtitles by
exploiting audio and text in a multimodal fashion, achieving high segmentation
quality in zero-shot conditions. Comparative experiments with SubST systems
respectively trained on manual and automatic segmentations result in similar
performance, showing the effectiveness of our approach.
Related papers
- Token-Level Serialized Output Training for Joint Streaming ASR and ST
Leveraging Textual Alignments [49.38965743465124]
This paper introduces a streaming Transformer-Transducer that jointly generates automatic speech recognition (ASR) and speech translation (ST) outputs using a single decoder.
Experiments in monolingual and multilingual settings demonstrate that our approach achieves the best quality-latency balance.
arXiv Detail & Related papers (2023-07-07T02:26:18Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation [71.35243644890537]
End-to-end Speech Translation (ST) aims at translating the source language speech into target language text without generating the intermediate transcriptions.
Existing zero-shot methods fail to align the two modalities of speech and text into a shared semantic space.
We propose a novel Discrete Cross-Modal Alignment (DCMA) method that employs a shared discrete vocabulary space to accommodate and match both modalities of speech and text.
arXiv Detail & Related papers (2022-10-18T03:06:47Z) - Direct Speech Translation for Automatic Subtitling [17.095483965591267]
We propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model.
Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition.
arXiv Detail & Related papers (2022-09-27T06:47:42Z) - Automatic Prosody Annotation with Pre-Trained Text-Speech Model [48.47706377700962]
We propose to automatically extract prosodic boundary labels from text-audio data via a neural text-speech model with pre-trained audio encoders.
This model is pre-trained on text and speech data separately and jointly fine-tuned on TTS data in a triplet format: speech, text, prosody
arXiv Detail & Related papers (2022-06-16T06:54:16Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Between Flexibility and Consistency: Joint Generation of Captions and
Subtitles [13.58711830450618]
Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing.
In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content.
Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.
arXiv Detail & Related papers (2021-07-13T17:06:04Z) - Aligning Subtitles in Sign Language Videos [80.20961722170655]
We train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video.
We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals.
Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not.
arXiv Detail & Related papers (2021-05-06T17:59:36Z) - Subtitles to Segmentation: Improving Low-Resource Speech-to-Text
Translation Pipelines [15.669334598926342]
We focus on improving ASR output segmentation in the context of low-resource language speech-to-text translation.
We use datasets of subtitles from TV shows and movies to train better ASR segmentation models.
We show that this noisy syntactic information can improve model accuracy.
arXiv Detail & Related papers (2020-10-19T17:32:40Z) - MuST-Cinema: a Speech-to-Subtitles corpus [16.070428245677675]
We present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles.
We show that the corpus can be used to build models that efficiently segment sentences into subtitles.
We propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.
arXiv Detail & Related papers (2020-02-25T12:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.