TRAVID: An End-to-End Video Translation Framework
- URL: http://arxiv.org/abs/2309.11338v1
- Date: Wed, 20 Sep 2023 14:13:05 GMT
- Title: TRAVID: An End-to-End Video Translation Framework
- Authors: Prottay Kumar Adhikary, Bandaru Sugandhi, Subhojit Ghimire, Santanu
Pal and Partha Pakray
- Abstract summary: We present an end-to-end video translation system that not only translates spoken language but also synchronizes the translated speech with the lip movements of the speaker.
Our system focuses on translating educational lectures in various Indian languages, and it is designed to be effective even in low-resource system settings.
- Score: 1.6131714685439382
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In today's globalized world, effective communication with people from diverse
linguistic backgrounds has become increasingly crucial. While traditional
methods of language translation, such as written text or voice-only
translations, can accomplish the task, they often fail to capture the complete
context and nuanced information conveyed through nonverbal cues like facial
expressions and lip movements. In this paper, we present an end-to-end video
translation system that not only translates spoken language but also
synchronizes the translated speech with the lip movements of the speaker. Our
system focuses on translating educational lectures in various Indian languages,
and it is designed to be effective even in low-resource system settings. By
incorporating lip movements that align with the target language and matching
them with the speaker's voice using voice cloning techniques, our application
offers an enhanced experience for students and users. This additional feature
creates a more immersive and realistic learning environment, ultimately making
the learning process more effective and engaging.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - CLARA: Multilingual Contrastive Learning for Audio Representation
Acquisition [5.520654376217889]
CLARA minimizes reliance on labelled data, enhancing generalization across languages.
Our approach adeptly captures emotional nuances in speech, overcoming subjective assessment issues.
It adapts to low-resource languages, marking progress in multilingual speech representation learning.
arXiv Detail & Related papers (2023-10-18T09:31:56Z) - Enhancing expressivity transfer in textless speech-to-speech translation [0.0]
Existing state-of-the-art systems fall short when it comes to capturing and transferring expressivity accurately across different languages.
This study presents a novel method that operates at the discrete speech unit level and leverages multilingual emotion embeddings.
We demonstrate how these embeddings can be used to effectively predict the pitch and duration of speech units in the target language.
arXiv Detail & Related papers (2023-10-11T08:07:22Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - EC^2: Emergent Communication for Embodied Control [72.99894347257268]
Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments.
We propose Emergent Communication for Embodied Control (EC2), a novel scheme to pre-train video-language representations for few-shot embodied control.
EC2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs.
arXiv Detail & Related papers (2023-04-19T06:36:02Z) - MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup
for Visual Speech Translation and Recognition [51.412413996510814]
We propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks.
MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2.
arXiv Detail & Related papers (2023-03-09T14:58:29Z) - Talking Face Generation with Multilingual TTS [0.8229645116651871]
We propose a system combining a talking face generation system with a text-to-speech system.
Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker.
For our demo, we add a translation API to the preprocessing stage and present it in the form of a neural dubber.
arXiv Detail & Related papers (2022-05-13T02:08:35Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z) - CSTNet: Contrastive Speech Translation Network for Self-Supervised
Speech Representation Learning [11.552745999302905]
More than half of the 7,000 languages in the world are in imminent danger of going extinct.
It is relatively easy to obtain textual translations corresponding to speech.
We construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech.
arXiv Detail & Related papers (2020-06-04T12:21:48Z) - That Sounds Familiar: an Analysis of Phonetic Representations Transfer
Across Languages [72.9927937955371]
We use the resources existing in other languages to train a multilingual automatic speech recognition model.
We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting.
Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages.
arXiv Detail & Related papers (2020-05-16T22:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.