Simultaneous Speech Translation for Live Subtitling: from Delay to
Display
- URL: http://arxiv.org/abs/2107.08807v2
- Date: Tue, 20 Jul 2021 09:27:39 GMT
- Title: Simultaneous Speech Translation for Live Subtitling: from Delay to
Display
- Authors: Alina Karakanta, Sara Papi, Matteo Negri, Marco Turchi
- Abstract summary: We explore the feasibility of simultaneous speech translation (SimulST) for live subtitling.
We adapt SimulST systems to predict subtitle breaks along with the translation.
We propose a display mode that exploits the predicted break structure by presenting the subtitles in scrolling lines.
- Score: 13.35771688595446
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With the increased audiovisualisation of communication, the need for live
subtitles in multilingual events is more relevant than ever. In an attempt to
automatise the process, we aim at exploring the feasibility of simultaneous
speech translation (SimulST) for live subtitling. However, the word-for-word
rate of generation of SimulST systems is not optimal for displaying the
subtitles in a comprehensible and readable way. In this work, we adapt SimulST
systems to predict subtitle breaks along with the translation. We then propose
a display mode that exploits the predicted break structure by presenting the
subtitles in scrolling lines. We compare our proposed mode with a display 1)
word-for-word and 2) in blocks, in terms of reading speed and delay.
Experiments on three language pairs (en$\rightarrow$it, de, fr) show that
scrolling lines is the only mode achieving an acceptable reading speed while
keeping delay close to a 4-second threshold. We argue that simultaneous
translation for readable live subtitles still faces challenges, the main one
being poor translation quality, and propose directions for steering future
research.
Related papers
- HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z) - Direct Speech Translation for Automatic Subtitling [17.095483965591267]
We propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model.
Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition.
arXiv Detail & Related papers (2022-09-27T06:47:42Z) - Anticipation-free Training for Simultaneous Translation [70.85761141178597]
Simultaneous translation (SimulMT) speeds up the translation process by starting to translate before the source sentence is completely available.
Existing methods increase latency or introduce adaptive read-write policies for SimulMT models to handle local reordering and improve translation quality.
We propose a new framework that decomposes the translation process into the monotonic translation step and the reordering step.
arXiv Detail & Related papers (2022-01-30T16:29:37Z) - SimulSLT: End-to-End Simultaneous Sign Language Translation [55.54237194555432]
Existing sign language translation methods need to read all the videos before starting the translation.
We propose SimulSLT, the first end-to-end simultaneous sign language translation model.
SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model.
arXiv Detail & Related papers (2021-12-08T11:04:52Z) - Aligning Subtitles in Sign Language Videos [80.20961722170655]
We train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video.
We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals.
Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not.
arXiv Detail & Related papers (2021-05-06T17:59:36Z) - Presenting Simultaneous Translation in Limited Space [0.0]
Some methods of automatic simultaneous translation of a long-form speech allow revisions of outputs, trading accuracy for low latency.
Subtitling must be shown promptly, incrementally, and with adequate time for reading.
We propose a way how to estimate the overall usability of the combination of automatic translation and subtitling by measuring the quality, latency, and stability on a test set.
arXiv Detail & Related papers (2020-09-18T18:37:03Z) - Is 42 the Answer to Everything in Subtitling-oriented Speech
Translation? [16.070428245677675]
Subtitling is becoming increasingly important for disseminating information.
We explore two methods for applying Speech Translation (ST) to subtitling.
arXiv Detail & Related papers (2020-06-01T17:02:28Z) - MuST-Cinema: a Speech-to-Subtitles corpus [16.070428245677675]
We present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles.
We show that the corpus can be used to build models that efficiently segment sentences into subtitles.
We propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.
arXiv Detail & Related papers (2020-02-25T12:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.