Dubbing in Practice: A Large Scale Study of Human Localization With
Insights for Automatic Dubbing
- URL: http://arxiv.org/abs/2212.12137v1
- Date: Fri, 23 Dec 2022 04:12:52 GMT
- Title: Dubbing in Practice: A Large Scale Study of Human Localization With
Insights for Automatic Dubbing
- Authors: William Brannon, Yogesh Virkar, Brian Thompson
- Abstract summary: We investigate how humans perform the task of dubbing video content from one language into another.
We leverage a novel corpus of 319.57 hours of video from 54 professionally produced titles.
- Score: 6.26764826816895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate how humans perform the task of dubbing video content from one
language into another, leveraging a novel corpus of 319.57 hours of video from
54 professionally produced titles. This is the first such large-scale study we
are aware of. The results challenge a number of assumptions commonly made in
both qualitative literature on human dubbing and machine-learning literature on
automatic dubbing, arguing for the importance of vocal naturalness and
translation quality over commonly emphasized isometric (character length) and
lip-sync constraints, and for a more qualified view of the importance of
isochronic (timing) constraints. We also find substantial influence of the
source-side audio on human dubs through channels other than the words of the
translation, pointing to the need for research on ways to preserve speech
characteristics, as well as semantic transfer such as emphasis/emotion, in
automatic dubbing systems.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing [73.56970726406274]
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
arXiv Detail & Related papers (2022-11-30T12:09:40Z) - Automatic dense annotation of large-vocabulary sign language videos [85.61513254261523]
We propose a simple, scalable framework to vastly increase the density of automatic annotations.
We make these annotations publicly available to support the sign language research community.
arXiv Detail & Related papers (2022-08-04T17:55:09Z) - Prosodic Alignment for off-screen automatic dubbing [17.7813193467431]
The goal of automatic dubbing is to perform speech-to-speech translation while achieving audiovisual coherence.
This entails isochrony, i.e., translating the original speech by also matching its prosodic structure into phrases and pauses.
We extend the prosodic alignment model to address off-screen dubbing that requires less stringent synchronization constraints.
arXiv Detail & Related papers (2022-04-06T01:02:58Z) - Machine Translation Verbosity Control for Automatic Dubbing [11.85772502779967]
We propose new methods to control the verbosity of machine translation output.
For experiments we use a public data set to dub English speeches into French, Italian, German and Spanish.
We report extensive subjective tests that measure the impact of MT verbosity control on the final quality of dubbed video clips.
arXiv Detail & Related papers (2021-10-08T01:19:10Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Is 42 the Answer to Everything in Subtitling-oriented Speech
Translation? [16.070428245677675]
Subtitling is becoming increasingly important for disseminating information.
We explore two methods for applying Speech Translation (ST) to subtitling.
arXiv Detail & Related papers (2020-06-01T17:02:28Z) - MuST-Cinema: a Speech-to-Subtitles corpus [16.070428245677675]
We present MuST-Cinema, a multilingual speech translation corpus built from TED subtitles.
We show that the corpus can be used to build models that efficiently segment sentences into subtitles.
We propose a method for annotating existing subtitling corpora with subtitle breaks, conforming to the constraint of length.
arXiv Detail & Related papers (2020-02-25T12:40:06Z) - From Speech-to-Speech Translation to Automatic Dubbing [28.95595497865406]
We present enhancements to a speech-to-speech translation pipeline in order to perform automatic dubbing.
Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance.
arXiv Detail & Related papers (2020-01-19T07:03:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.