MuST-Cinema: a Speech-to-Subtitles corpus
- URL: http://arxiv.org/abs/2002.10829v1
- Date: Tue, 25 Feb 2020 12:40:06 GMT
- Title: MuST-Cinema: a Speech-to-Subtitles corpus
- Authors: Alina Karakanta, Matteo Negri, Marco Turchi
- Abstract summary: We present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles.
We show that the corpus can be used to build models that efficiently segment sentences into subtitles.
We propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.
- Score: 16.070428245677675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Growing needs in localising audiovisual content in multiple languages through
subtitles call for the development of automatic solutions for human subtitling.
Neural Machine Translation (NMT) can contribute to the automatisation of
subtitling, facilitating the work of human subtitlers and reducing turn-around
times and related costs. NMT requires high-quality, large, task-specific
training data. The existing subtitling corpora, however, are missing both
alignments to the source language audio and important information about
subtitle breaks. This poses a significant limitation for developing efficient
automatic approaches for subtitling, since the length and form of a subtitle
directly depends on the duration of the utterance. In this work, we present
MuST-Cinema, a multilingual speech translation corpus built from TED subtitles.
The corpus is comprised of (audio, transcription, translation) triplets.
Subtitle breaks are preserved by inserting special symbols. We show that the
corpus can be used to build models that efficiently segment sentences into
subtitles and propose a method for annotating existing subtitling corpora with
subtitle breaks, conforming to the constraint of length.
Related papers
- Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing [73.56970726406274]
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
arXiv Detail & Related papers (2022-11-30T12:09:40Z) - Direct Speech Translation for Automatic Subtitling [17.095483965591267]
We propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model.
Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition.
arXiv Detail & Related papers (2022-09-27T06:47:42Z) - Dodging the Data Bottleneck: Automatic Subtitling with Automatically
Segmented ST Corpora [15.084508754409848]
Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles.
We propose a method to convert existing ST corpora into SubST resources without human intervention.
We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion.
arXiv Detail & Related papers (2022-09-21T19:06:36Z) - AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure.
We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z) - Between Flexibility and Consistency: Joint Generation of Captions and
Subtitles [13.58711830450618]
Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing.
In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content.
Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.
arXiv Detail & Related papers (2021-07-13T17:06:04Z) - Aligning Subtitles in Sign Language Videos [80.20961722170655]
We train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video.
We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals.
Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not.
arXiv Detail & Related papers (2021-05-06T17:59:36Z) - Is 42 the Answer to Everything in Subtitling-oriented Speech
Translation? [16.070428245677675]
Subtitling is becoming increasingly important for disseminating information.
We explore two methods for applying Speech Translation (ST) to subtitling.
arXiv Detail & Related papers (2020-06-01T17:02:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.