VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning
- URL: http://arxiv.org/abs/2211.11275v2
- Date: Fri, 19 May 2023 10:03:56 GMT
- Title: VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning
- Authors: Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie
Zhang, Lirong Dai, Daxin Jiang, Jinyu Li, Furu Wei
- Abstract summary: We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
- Score: 119.49605266839053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although speech is a simple and effective way for humans to communicate with
the outside world, a more realistic speech interaction contains multimodal
information, e.g., vision, text. How to design a unified framework to integrate
different modal information and leverage different resources (e.g.,
visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to
facilitate speech representation learning was not well explored. In this paper,
we propose a unified cross-modal representation learning framework VATLM
(Visual-Audio-Text Language Model). The proposed VATLM employs a unified
backbone network to model the modality-independent information and utilizes
three simple modality-dependent modules to preprocess visual, speech, and text
inputs. In order to integrate these three modalities into one shared semantic
space, VATLM is optimized with a masked prediction task of unified tokens,
given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on
audio-visual related downstream tasks, including audio-visual speech
recognition (AVSR), visual speech recognition (VSR) tasks. Results show that
the proposed VATLM outperforms previous the state-of-the-art models, such as
audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that
VATLM is capable of aligning different modalities into the same space. To
facilitate future research, we release the code and pre-trained models at
https://aka.ms/vatlm.
Related papers
- Learning to Unify Audio, Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization [4.062872727927056]
The goal of Multilingual Visual Answer localization (MVAL) is to locate a video segment that answers a given multilingual question.
Existing methods either focus solely on visual modality or integrate visual and subtitle modalities.
We propose a unified Audio-Visual-Textual Span localization (AVTSL) method that incorporates audio modality to augment both visual and textual representations.
arXiv Detail & Related papers (2024-11-05T06:49:14Z) - CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection [2.110168344647122]
Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech.
We introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models.
Our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.
arXiv Detail & Related papers (2024-10-18T14:43:34Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z) - AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.