SBAAM! Eliminating Transcript Dependency in Automatic Subtitling
- URL: http://arxiv.org/abs/2405.10741v1
- Date: Fri, 17 May 2024 12:42:56 GMT
- Title: SBAAM! Eliminating Transcript Dependency in Automatic Subtitling
- Authors: Marco Gaido, Sara Papi, Matteo Negri, Mauro Cettolo, Luisa Bentivogli,
- Abstract summary: Subtitling plays a crucial role in enhancing the accessibility of audiovisual content.
Past attempts to automate this process rely to varying degrees, on automatic transcripts.
We introduce the first direct model capable of producing automatic subtitles.
- Score: 23.444615994847947
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subtitling plays a crucial role in enhancing the accessibility of audiovisual content and encompasses three primary subtasks: translating spoken dialogue, segmenting translations into concise textual units, and estimating timestamps that govern their on-screen duration. Past attempts to automate this process rely, to varying degrees, on automatic transcripts, employed diversely for the three subtasks. In response to the acknowledged limitations associated with this reliance on transcripts, recent research has shifted towards transcription-free solutions for translation and segmentation, leaving the direct generation of timestamps as uncharted territory. To fill this gap, we introduce the first direct model capable of producing automatic subtitles, entirely eliminating any dependence on intermediate transcripts also for timestamp prediction. Experimental results, backed by manual evaluation, showcase our solution's new state-of-the-art performance across multiple language pairs and diverse conditions.
Related papers
- Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset.
We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z) - Long-Form End-to-End Speech Translation via Latent Alignment
Segmentation [6.153530338207679]
Current simultaneous speech translation models can process audio only up to a few seconds long.
We propose a novel segmentation approach for a low-latency end-to-end speech translation.
We show that the proposed approach achieves state-of-the-art quality at no additional computational cost.
arXiv Detail & Related papers (2023-09-20T15:10:12Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Direct Speech Translation for Automatic Subtitling [17.095483965591267]
We propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model.
Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition.
arXiv Detail & Related papers (2022-09-27T06:47:42Z) - Dodging the Data Bottleneck: Automatic Subtitling with Automatically
Segmented ST Corpora [15.084508754409848]
Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles.
We propose a method to convert existing ST corpora into SubST resources without human intervention.
We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion.
arXiv Detail & Related papers (2022-09-21T19:06:36Z) - Punctuation Restoration [69.97278287534157]
This work presents a new human-annotated corpus, called BehancePR, for punctuation restoration in livestreaming video transcripts.
Our experiments on BehancePR demonstrate the challenges of punctuation restoration for this domain.
arXiv Detail & Related papers (2022-02-19T23:12:57Z) - Machine Translation Verbosity Control for Automatic Dubbing [11.85772502779967]
We propose new methods to control the verbosity of machine translation output.
For experiments we use a public data set to dub English speeches into French, Italian, German and Spanish.
We report extensive subjective tests that measure the impact of MT verbosity control on the final quality of dubbed video clips.
arXiv Detail & Related papers (2021-10-08T01:19:10Z) - SimulLR: Simultaneous Lip Reading Transducer with Attention-Guided
Adaptive Memory [61.44510300515693]
We study the task of simultaneous lip and devise SimulLR, a simultaneous lip Reading transducer with attention-guided adaptive memory.
The experiments show that the SimulLR achieves the translation speedup 9.10 times times compared with the state-of-the-art non-simultaneous methods.
arXiv Detail & Related papers (2021-08-31T05:54:16Z) - A Sliding-Window Approach to Automatic Creation of Meeting Minutes [66.39584679676817]
Meeting minutes record any subject matters discussed, decisions reached and actions taken at meetings.
We present a sliding window approach to automatic generation of meeting minutes.
It aims to tackle issues associated with the nature of spoken text, including lengthy transcripts and lack of document structure.
arXiv Detail & Related papers (2021-04-26T02:44:14Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Consistent Transcription and Translation of Speech [13.652411093089947]
We explore the task of jointly transcribing and translating speech.
While high accuracy of transcript and translation are crucial, even highly accurate systems can suffer from inconsistencies between both outputs.
We find that direct models are poorly suited to the joint transcription/translation task, but that end-to-end models that feature a coupled inference procedure are able to achieve strong consistency.
arXiv Detail & Related papers (2020-07-24T19:17:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.