Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture
Videos into Multiple Indian Languages
- URL: http://arxiv.org/abs/2211.01338v1
- Date: Tue, 1 Nov 2022 07:06:29 GMT
- Title: Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture
Videos into Multiple Indian Languages
- Authors: Anusha Prakash, Arun Kumar, Ashish Seth, Bhagyashree Mukherjee, Ishika
Gupta, Jom Kuriakose, Jordan Fernandes, K V Vikram, Mano Ranjith Kumar M,
Metilda Sagaya Mary, Mohammad Wajahat, Mohana N, Mudit Batra, Navina K, Nihal
John George, Nithya Ravi, Pruthwik Mishra, Sudhanshu Srivastava, Vasista Sai
Lodagala, Vandan Mujadia, Kada Sai Venkata Vineeth, Vrunda Sukhadia, Dipti
Sharma, Hema Murthy, Pushpak Bhattacharya, S Umesh, Rajeev Sangal
- Abstract summary: Cross-lingual dubbing of lecture videos requires the transcription of the original audio, correction and removal of disfluencies.
This paper describes the challenges in regenerating English lecture videos in Indian languages semi-automatically.
- Score: 5.17905382659474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-lingual dubbing of lecture videos requires the transcription of the
original audio, correction and removal of disfluencies, domain term discovery,
text-to-text translation into the target language, chunking of text using
target language rhythm, text-to-speech synthesis followed by isochronous
lipsyncing to the original video. This task becomes challenging when the source
and target languages belong to different language families, resulting in
differences in generated audio duration. This is further compounded by the
original speaker's rhythm, especially for extempore speech. This paper
describes the challenges in regenerating English lecture videos in Indian
languages semi-automatically. A prototype is developed for dubbing lectures
into 9 Indian languages. A mean-opinion-score (MOS) is obtained for two
languages, Hindi and Tamil, on two different courses. The output video is
compared with the original video in terms of MOS (1-5) and lip synchronisation
with scores of 4.09 and 3.74, respectively. The human effort also reduces by
75%.
Related papers
- MulliVC: Multi-lingual Voice Conversion With Cycle Consistency [75.59590240034261]
MulliVC is a novel voice conversion system that only converts timbre and keeps original content and source language prosody without multi-lingual paired data.
Both objective and subjective results indicate that MulliVC significantly surpasses other methods in both monolingual and cross-lingual contexts.
arXiv Detail & Related papers (2024-08-08T18:12:51Z) - Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding [19.544839928488972]
We construct a large-scale multilingual video story dataset named Multilingual Synopses of Movie Narratives (M-SYMON)
M-SYMON contains 13,166 movie summary videos from 7 languages, as well as manual annotation of fine-grained video-text correspondences for 101.5 hours of video.
Training on the human annotated data from SyMoN outperforms the SOTA methods by 15.7 and 16.2 percentage points on Clip Accuracy and Sentence IoU scores, respectively.
arXiv Detail & Related papers (2024-06-18T22:44:50Z) - Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech.
We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z) - Direct Punjabi to English speech translation using discrete units [4.883313216485195]
We present a direct speech-to-speech translation model for one of the Indic languages called Punjabi to English.
We also explore the performance of using a discrete representation of speech called discrete acoustic units as input to the Transformer-based translation model.
Our results show that the U2UT model performs better than the Speech-to-Unit Translation (S2UT) model by a 3.69 BLEU score.
arXiv Detail & Related papers (2024-02-25T03:03:34Z) - TransFace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head
Translation [54.155138561698514]
Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning.
Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors.
We propose a model for talking head translation, textbfTransFace, which can directly translate audio-visual speech into audio-visual speech in other languages.
arXiv Detail & Related papers (2023-12-23T08:45:57Z) - TRAVID: An End-to-End Video Translation Framework [1.6131714685439382]
We present an end-to-end video translation system that not only translates spoken language but also synchronizes the translated speech with the lip movements of the speaker.
Our system focuses on translating educational lectures in various Indian languages, and it is designed to be effective even in low-resource system settings.
arXiv Detail & Related papers (2023-09-20T14:13:05Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - VideoDubber: Machine Translation with Speech-Aware Length Control for
Video Dubbing [73.56970726406274]
Video dubbing aims to translate the original speech in a film or television program into the speech in a target language.
To ensure the translated speech to be well aligned with the corresponding video, the length/duration of the translated speech should be as close as possible to that of the original speech.
We propose a machine translation system tailored for the task of video dubbing, which directly considers the speech duration of each token in translation.
arXiv Detail & Related papers (2022-11-30T12:09:40Z) - Towards Automatic Speech to Sign Language Generation [35.22004819666906]
We propose a multi-language transformer network trained to generate signer's poses from speech segments.
Our model learns to generate continuous sign pose sequences in an end-to-end manner.
arXiv Detail & Related papers (2021-06-24T06:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.