Related papers: Direct Speech Translation for Automatic Subtitling

Direct Speech Translation for Automatic Subtitling

URL: http://arxiv.org/abs/2209.13192v2
Date: Tue, 25 Jul 2023 18:12:16 GMT
Title: Direct Speech Translation for Automatic Subtitling
Authors: Sara Papi, Marco Gaido, Alina Karakanta, Mauro Cettolo, Matteo Negri, Marco Turchi
Abstract summary: We propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model. Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition.
Score: 17.095483965591267
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Automatic subtitling is the task of automatically translating the speech of audiovisual content into short pieces of timed text, i.e. subtitles and their corresponding timestamps. The generated subtitles need to conform to space and time requirements, while being synchronised with the speech and segmented in a way that facilitates comprehension. Given its considerable complexity, the task has so far been addressed through a pipeline of components that separately deal with transcribing, translating, and segmenting text into subtitles, as well as predicting timestamps. In this paper, we propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model. Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition, also being competitive with production tools on both in-domain and newly-released out-domain benchmarks covering new scenarios.

Related papers

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing [15.134076873312809]
Cross-lingual dubbing system translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed.<n>We propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation.<n>We then synthesize speech based on the predicted units and source identity with a conditional flow matching model.
arXiv Detail & Related papers (2025-05-27T08:43:28Z)
High-Fidelity Simultaneous Speech-To-Speech Translation [75.69884829562591]
We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation.
arXiv Detail & Related papers (2025-02-05T17:18:55Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
SBAAM! Eliminating Transcript Dependency in Automatic Subtitling [23.444615994847947]
Subtitling plays a crucial role in enhancing the accessibility of audiovisual content. Past attempts to automate this process rely to varying degrees, on automatic transcripts. We introduce the first direct model capable of producing automatic subtitles.
arXiv Detail & Related papers (2024-05-17T12:42:56Z)
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale. We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z)
Improving Isochronous Machine Translation with Target Factors and Auxiliary Counters [71.02335065794384]
We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences. We show that our model improves translation quality and isochrony compared to previous work.
arXiv Detail & Related papers (2023-05-22T16:36:04Z)
Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing [71.02335065794384]
We propose a model that directly optimize both the translation as well as the speech duration of the generated translations. We show that this system generates speech that better matches the timing of the original speech, compared to prior work, while simplifying the system architecture.
arXiv Detail & Related papers (2023-02-25T04:23:25Z)
Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment. Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description. Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z)
Dodging the Data Bottleneck: Automatic Subtitling with Automatically Segmented ST Corpora [15.084508754409848]
Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles. We propose a method to convert existing ST corpora into SubST resources without human intervention. We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion.
arXiv Detail & Related papers (2022-09-21T19:06:36Z)
Between Flexibility and Consistency: Joint Generation of Captions and Subtitles [13.58711830450618]
Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing. In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content. Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.
arXiv Detail & Related papers (2021-07-13T17:06:04Z)
Aligning Subtitles in Sign Language Videos [80.20961722170655]
We train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video. We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals. Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not.
arXiv Detail & Related papers (2021-05-06T17:59:36Z)
MuST-Cinema: a Speech-to-Subtitles corpus [16.070428245677675]
We present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles. We show that the corpus can be used to build models that efficiently segment sentences into subtitles. We propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.
arXiv Detail & Related papers (2020-02-25T12:40:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.