Length Aware Speech Translation for Video Dubbing
- URL: http://arxiv.org/abs/2506.00740v1
- Date: Sat, 31 May 2025 23:01:50 GMT
- Title: Length Aware Speech Translation for Video Dubbing
- Authors: Harveen Singh Chadha, Aswin Shanmugam Subramanian, Vikas Joshi, Shubham Bansal, Jian Xue, Rupeshkumar Mehta, Jinyu Li,
- Abstract summary: We develop a phoneme-based end-to-end length-sensitive speech translation model, which generates translations of varying lengths short, normal, and long.<n>We also introduce length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass.
- Score: 27.946422755130868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In video dubbing, aligning translated audio with the source audio is a significant challenge. Our focus is on achieving this efficiently, tailored for real-time, on-device video dubbing scenarios. We developed a phoneme-based end-to-end length-sensitive speech translation (LSST) model, which generates translations of varying lengths short, normal, and long using predefined tags. Additionally, we introduced length-aware beam search (LABS), an efficient approach to generate translations of different lengths in a single decoding pass. This approach maintained comparable BLEU scores compared to a baseline without length awareness while significantly enhancing synchronization quality between source and target audio, achieving a mean opinion score (MOS) gain of 0.34 for Spanish and 0.65 for Korean, respectively.
Related papers
- Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing [15.134076873312809]
Cross-lingual dubbing system translates speech from one language to another while preserving key characteristics such as duration, speaker identity, and speaking speed.<n>We propose a discrete diffusion-based speech-to-unit translation model with explicit duration control, enabling time-aligned translation.<n>We then synthesize speech based on the predicted units and source identity with a conditional flow matching model.
arXiv Detail & Related papers (2025-05-27T08:43:28Z) - BLAB: Brutally Long Audio Bench [90.20616799311578]
Brutally Long Audio Bench (BLAB) is a long-form audio benchmark that evaluates audio LMs on localization, duration estimation, emotion, and counting tasks.<n>BLAB consists of 833+ hours of diverse, full-length audio clips, each paired with human-annotated, text-based natural language questions and answers.<n>We evaluate six open-source and proprietary audio LMs on BLAB and find that all of them, including advanced models such as Gemini 2.0 Pro and GPT-4o, struggle with the tasks.
arXiv Detail & Related papers (2025-05-05T22:28:53Z) - Isochrony-Controlled Speech-to-Text Translation: A study on translating from Sino-Tibetan to Indo-European Languages [33.5772006275197]
End-to-end speech translation (ST) translates source language speech directly into target language text.
Many ST applications require strict length control to ensure that the translation duration matches the length of the source audio.
We present improvements to the duration alignment component of our sequence-to-sequence ST model.
arXiv Detail & Related papers (2024-11-11T21:39:21Z) - TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation [54.155138561698514]
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning.
Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.
We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
arXiv Detail & Related papers (2023-12-23T08:45:57Z) - Improving Isochronous Machine Translation with Target Factors and
Auxiliary Counters [71.02335065794384]
We introduce target factors in a transformer model to predict durations jointly with target language phoneme sequences.
We show that our model improves translation quality and isochrony compared to previous work.
arXiv Detail & Related papers (2023-05-22T16:36:04Z) - VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing [73.56970726406274]
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
arXiv Detail & Related papers (2022-11-30T12:09:40Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound [103.28102473127748]
We introduce an audiovisual method for long-range text-to-video retrieval.
Our approach aims to retrieve minute-long videos that capture complex human actions.
Our method is 2.92x faster and 2.34x memory-efficient than long-range video-only approaches.
arXiv Detail & Related papers (2022-04-06T14:43:42Z) - Creating Speech-to-Speech Corpus from Dubbed Series [8.21384946488751]
We propose an unsupervised approach to construct speech-to-speech corpus, aligned on short segment levels.
Our methodology exploits video frames, speech recognition, machine translation, and noisy frames removal algorithms to match segments in both languages.
Our pipeline was able to generate 17 hours of paired segments, which is about 47% of the corpus.
arXiv Detail & Related papers (2022-03-07T18:52:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.