Is 42 the Answer to Everything in Subtitling-oriented Speech
Translation?
- URL: http://arxiv.org/abs/2006.01080v1
- Date: Mon, 1 Jun 2020 17:02:28 GMT
- Title: Is 42 the Answer to Everything in Subtitling-oriented Speech
Translation?
- Authors: Alina Karakanta, Matteo Negri, Marco Turchi
- Abstract summary: Subtitling is becoming increasingly important for disseminating information.
We explore two methods for applying Speech Translation (ST) to subtitling.
- Score: 16.070428245677675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Subtitling is becoming increasingly important for disseminating information,
given the enormous amounts of audiovisual content becoming available daily.
Although Neural Machine Translation (NMT) can speed up the process of
translating audiovisual content, large manual effort is still required for
transcribing the source language, and for spotting and segmenting the text into
proper subtitles. Creating proper subtitles in terms of timing and segmentation
highly depends on information present in the audio (utterance duration, natural
pauses). In this work, we explore two methods for applying Speech Translation
(ST) to subtitling: a) a direct end-to-end and b) a classical cascade approach.
We discuss the benefit of having access to the source language speech for
improving the conformity of the generated subtitles to the spatial and temporal
subtitling constraints and show that length is not the answer to everything in
the case of subtitling-oriented ST.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Exploring the Role of Audio in Video Captioning [59.679122191706426]
We present an audio-visual framework, which aims to fully exploit the potential of the audio modality for captioning.
We propose new local-global fusion mechanisms to improve information exchange across audio and video.
arXiv Detail & Related papers (2023-06-21T20:54:52Z) - MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup
for Visual Speech Translation and Recognition [51.412413996510814]
We propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks.
MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2.
arXiv Detail & Related papers (2023-03-09T14:58:29Z) - Jointly Optimizing Translations and Speech Timing to Improve Isochrony
in Automatic Dubbing [71.02335065794384]
We propose a model that directly optimize both the translation as well as the speech duration of the generated translations.
We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
arXiv Detail & Related papers (2023-02-25T04:23:25Z) - VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing [73.56970726406274]
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
arXiv Detail & Related papers (2022-11-30T12:09:40Z) - Direct Speech Translation for Automatic Subtitling [17.095483965591267]
We propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model.
Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition.
arXiv Detail & Related papers (2022-09-27T06:47:42Z) - Simultaneous Speech Translation for Live Subtitling: from Delay to
Display [13.35771688595446]
We explore the feasibility of simultaneous speech translation (SimulST) for live subtitling.
We adapt SimulST systems to predict subtitle breaks along with the translation.
We propose a display mode that exploits the predicted break structure by presenting the subtitles in scrolling lines.
arXiv Detail & Related papers (2021-07-19T12:35:49Z) - Between Flexibility and Consistency: Joint Generation of Captions and
Subtitles [13.58711830450618]
Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing.
In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content.
Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.
arXiv Detail & Related papers (2021-07-13T17:06:04Z) - MuST-Cinema: a Speech-to-Subtitles corpus [16.070428245677675]
We present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles.
We show that the corpus can be used to build models that efficiently segment sentences into subtitles.
We propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.
arXiv Detail & Related papers (2020-02-25T12:40:06Z) - From Speech-to-Speech Translation to Automatic Dubbing [28.95595497865406]
We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing.
Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance.
arXiv Detail & Related papers (2020-01-19T07:03:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.