VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset
- URL: http://arxiv.org/abs/2304.08345v1
- Date: Mon, 17 Apr 2023 15:08:15 GMT
- Title: VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset
- Authors: Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang,
Jinhui Tang, Jing Liu
- Abstract summary: We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
- Score: 53.46019570679092
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining
model (VALOR) for multi-modal understanding and generation. Different from
widely-studied vision-language pretraining models, VALOR jointly models
relationships of vision, audio and language in an end-to-end manner. It
contains three separate encoders for single modality representations, and a
decoder for multimodal conditional text generation. We design two pretext tasks
to pretrain VALOR model, including Multimodal Grouping Alignment (MGA) and
Multimodal Grouping Captioning (MGC). MGA projects vision, language and audio
to the same common space, building vision-language, audio-language and
audiovisual-language alignment simultaneously. MGC learns how to generate text
tokens in conditions of vision, audio or their both. To promote
vision-audio-language pretraining research, we construct a large-scale
high-quality tri-modality dataset named VALOR-1M, which contains 1M audiable
videos with human annotated audiovisual captions. Extensive experiments show
that VALOR can learn strong multimodal correlations and be generalized to
various downstream tasks (e.g., retrieval, captioning and question answering),
with different input modalities (e.g., vision-language, audio-language and
audiovisual-language). VALOR achieves new state-of-the-art performances on
series of public cross-modality benchmarks. Code and data are available at
project page https://casia-iva-group.github.io/projects/VALOR.
Related papers
- Unified Video-Language Pre-training with Synchronized Audio [21.607860535968356]
We propose an enhanced framework for Video-Language pre-training with Synchronized Audio.
Our framework learns tri-modal representations in a unified self-supervised transformer.
Our model pre-trained on only 0.9M data achieves improving results against state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-12T07:59:46Z) - Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation [55.15299351110525]
This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model.
We propose a novel training strategy, processing with visual speech units.
We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models.
arXiv Detail & Related papers (2024-01-18T08:46:02Z) - i-Code V2: An Autoregressive Generation Framework over Vision, Language,
and Speech Data [101.52821120195975]
i-Code V2 is first model capable of generating natural language from any combination of Vision, Language, and Speech data.
System is pretrained end-to-end on a large collection of dual- and single-modality datasets.
arXiv Detail & Related papers (2023-05-21T01:25:44Z) - Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities.
Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z) - OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset [14.619865864254924]
Open Large-scale Korean Audio-Visual Speech (OLKAVS) dataset is the largest among publicly available audio-visual speech datasets.
The dataset contains 1,150 hours of transcribed audio from 1,107 Korean speakers in a studio setup with nine different viewpoints and various noise situations.
arXiv Detail & Related papers (2023-01-16T11:40:50Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.