MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup
for Visual Speech Translation and Recognition
- URL: http://arxiv.org/abs/2303.05309v1
- Date: Thu, 9 Mar 2023 14:58:29 GMT
- Title: MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup
for Visual Speech Translation and Recognition
- Authors: Xize Cheng, Linjun Li, Tao Jin, Rongjie Huang, Wang Lin, Zehan Wang,
Huangdai Liu, Ye Wang, Aoxiong Yin, Zhou Zhao
- Abstract summary: We propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks.
MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2.
- Score: 51.412413996510814
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-media communications facilitate global interaction among people.
However, despite researchers exploring cross-lingual translation techniques
such as machine translation and audio speech translation to overcome language
barriers, there is still a shortage of cross-lingual studies on visual speech.
This lack of research is mainly due to the absence of datasets containing
visual speech and translated text pairs. In this paper, we present
\textbf{AVMuST-TED}, the first dataset for \textbf{A}udio-\textbf{V}isual
\textbf{Mu}ltilingual \textbf{S}peech \textbf{T}ranslation, derived from
\textbf{TED} talks. Nonetheless, visual speech is not as distinguishable as
audio speech, making it difficult to develop a mapping from source speech
phonemes to the target language text. To address this issue, we propose
MixSpeech, a cross-modality self-learning framework that utilizes audio speech
to regularize the training of visual speech tasks. To further minimize the
cross-modality gap and its impact on knowledge transfer, we suggest adopting
mixed speech, which is created by interpolating audio and visual streams, along
with a curriculum learning strategy to adjust the mixing ratio as needed.
MixSpeech enhances speech translation in noisy environments, improving BLEU
scores for four languages on AVMuST-TED by +1.4 to +4.2. Moreover, it achieves
state-of-the-art performance in lip reading on CMLR (11.1\%), LRS2 (25.5\%),
and LRS3 (28.0\%).
Related papers
- SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought [12.54786997634534]
This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST.
We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting.
The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units.
arXiv Detail & Related papers (2024-05-30T18:28:31Z) - XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception [62.660135152900615]
Speech recognition and translation systems perform poorly on noisy inputs.
XLAVS-R is a cross-lingual audio-visual speech representation model for noise-robust speech recognition and translation.
arXiv Detail & Related papers (2024-03-21T13:52:17Z) - Toward Joint Language Modeling for Speech Units and Text [89.32163954508489]
We explore joint language modeling for speech units and text.
We introduce automatic metrics to evaluate how well the joint LM mixes speech and text.
Our results show that by mixing speech units and text with our proposed mixing techniques, the joint LM improves over a speech-only baseline on SLU tasks.
arXiv Detail & Related papers (2023-10-12T20:53:39Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
and Lip Reading [24.744371143092614]
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos.
We propose LipSound2, which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms.
arXiv Detail & Related papers (2021-12-09T08:11:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.