Between Flexibility and Consistency: Joint Generation of Captions and
Subtitles
- URL: http://arxiv.org/abs/2107.06246v1
- Date: Tue, 13 Jul 2021 17:06:04 GMT
- Title: Between Flexibility and Consistency: Joint Generation of Captions and
Subtitles
- Authors: Alina Karakanta, Marco Gaido, Matteo Negri, Marco Turchi
- Abstract summary: Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing.
In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content.
Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.
- Score: 13.58711830450618
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speech translation (ST) has lately received growing interest for the
generation of subtitles without the need for an intermediate source language
transcription and timing (i.e. captions). However, the joint generation of
source captions and target subtitles does not only bring potential output
quality advantages when the two decoding processes inform each other, but it is
also often required in multilingual scenarios. In this work, we focus on ST
models which generate consistent captions-subtitles in terms of structure and
lexical content. We further introduce new metrics for evaluating subtitling
consistency. Our findings show that joint decoding leads to increased
performance and consistency between the generated captions and subtitles while
still allowing for sufficient flexibility to produce subtitles conforming to
language-specific needs and norms.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Direct Speech Translation for Automatic Subtitling [17.095483965591267]
We propose the first direct ST model for automatic subtitling that generates subtitles in the target language along with their timestamps with a single model.
Our experiments on 7 language pairs show that our approach outperforms a cascade system in the same data condition.
arXiv Detail & Related papers (2022-09-27T06:47:42Z) - Dodging the Data Bottleneck: Automatic Subtitling with Automatically
Segmented ST Corpora [15.084508754409848]
Speech translation for subtitling (SubST) is the task of automatically translating speech data into well-formed subtitles.
We propose a method to convert existing ST corpora into SubST resources without human intervention.
We build a segmenter model that automatically segments texts into proper subtitles by exploiting audio and text in a multimodal fashion.
arXiv Detail & Related papers (2022-09-21T19:06:36Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Aligning Subtitles in Sign Language Videos [80.20961722170655]
We train on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video.
We use BERT subtitle embeddings and CNN video representations learned for sign recognition to encode the two signals.
Our model outputs frame-level predictions, i.e., for each video frame, whether it belongs to the queried subtitle or not.
arXiv Detail & Related papers (2021-05-06T17:59:36Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Is 42 the Answer to Everything in Subtitling-oriented Speech
Translation? [16.070428245677675]
Subtitling is becoming increasingly important for disseminating information.
We explore two methods for applying Speech Translation (ST) to subtitling.
arXiv Detail & Related papers (2020-06-01T17:02:28Z) - MuST-Cinema: a Speech-to-Subtitles corpus [16.070428245677675]
We present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles.
We show that the corpus can be used to build models that efficiently segment sentences into subtitles.
We propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.
arXiv Detail & Related papers (2020-02-25T12:40:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.