MeetDot: Videoconferencing with Live Translation Captions
- URL: http://arxiv.org/abs/2109.09577v1
- Date: Mon, 20 Sep 2021 14:34:14 GMT
- Title: MeetDot: Videoconferencing with Live Translation Captions
- Authors: Arkady Arkhangorodsky, Christopher Chu, Scot Fang, Yiqi Huang, Denglin
Jiang, Ajay Nagesh, Boliang Zhang, Kevin Knight
- Abstract summary: We present MeetDot, a videoconferencing system with live translation captions overlaid on screen.
Our system supports speech and captions in 4 languages and combines automatic speech recognition (ASR) and machine translation (MT) in a cascade.
We implement several features to enhance user experience and reduce their cognitive load, such as smooth scrolling captions and reducing caption flicker.
- Score: 18.60812558978417
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present MeetDot, a videoconferencing system with live translation captions
overlaid on screen. The system aims to facilitate conversation between people
who speak different languages, thereby reducing communication barriers between
multilingual participants. Currently, our system supports speech and captions
in 4 languages and combines automatic speech recognition (ASR) and machine
translation (MT) in a cascade. We use the re-translation strategy to translate
the streamed speech, resulting in caption flicker. Additionally, our system has
very strict latency requirements to have acceptable call quality. We implement
several features to enhance user experience and reduce their cognitive load,
such as smooth scrolling captions and reducing caption flicker. The modular
architecture allows us to integrate different ASR and MT services in our
backend. Our system provides an integrated evaluation suite to optimize key
intrinsic evaluation metrics such as accuracy, latency and erasure. Finally, we
present an innovative cross-lingual word-guessing game as an extrinsic
evaluation metric to measure end-to-end system performance. We plan to make our
system open-source for research purposes.
Related papers
- An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Integrating Knowledge in End-to-End Automatic Speech Recognition for
Mandarin-English Code-Switching [41.88097793717185]
Code-Switching (CS) is a common linguistic phenomenon in multilingual communities.
This paper presents our investigations on end-to-end speech recognition for Mandarin-English CS speech.
arXiv Detail & Related papers (2021-12-19T17:31:15Z) - An Adversarial Learning based Multi-Step Spoken Language Understanding
System through Human-Computer Interaction [70.25183730482915]
We introduce a novel multi-step spoken language understanding system based on adversarial learning.
We demonstrate that the new system can improve parsing performance by at least $2.5%$ in terms of F1.
arXiv Detail & Related papers (2021-06-06T03:46:53Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Dynamic Masking for Improved Stability in Spoken Language Translation [8.591381243212712]
We show how a mask can be set to improve the latency-flicker trade-off without sacrificing translation quality.
A possible solution is to add a fixed delay, or "mask" to the the output of the MT system.
We show how this mask can be set dynamically, improving the latency-flicker trade-off without sacrificing translation quality.
arXiv Detail & Related papers (2020-05-30T12:23:10Z) - Towards Automatic Face-to-Face Translation [30.841020484914527]
"Face-to-Face Translation" can translate a video of a person speaking in language A into a target language B with realistic lip synchronization.
We build a working speech-to-speech translation system by bringing together multiple existing modules from speech and language.
We then move towards "Face-to-Face Translation" by incorporating a novel visual module, LipGAN for generating realistic talking faces from the translated audio.
arXiv Detail & Related papers (2020-03-01T06:42:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.