VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
Dataset
- URL: http://arxiv.org/abs/2305.18500v2
- Date: Sat, 7 Oct 2023 12:58:26 GMT
- Title: VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
Dataset
- Authors: Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin
Zhu, Jing Liu
- Abstract summary: We explore an automatically generated large-scale omni-modality video caption dataset called VAST-27M.
We first collect 27 million open-domain video clips and separately train a vision and an audio captioner to generate vision and audio captions.
We employ an off-the-shelf Large Language Model (LLM) to integrate the generated captions, together with subtitles and instructional prompts into omni-modality captions.
- Score: 17.927825332032477
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision and text have been fully explored in contemporary video-text
foundational models, while other modalities such as audio and subtitles in
videos have not received sufficient attention. In this paper, we resort to
establish connections between multi-modality video tracks, including Vision,
Audio, and Subtitle, and Text by exploring an automatically generated
large-scale omni-modality video caption dataset called VAST-27M. Specifically,
we first collect 27 million open-domain video clips and separately train a
vision and an audio captioner to generate vision and audio captions. Then, we
employ an off-the-shelf Large Language Model (LLM) to integrate the generated
captions, together with subtitles and instructional prompts into omni-modality
captions. Based on the proposed VAST-27M dataset, we train an omni-modality
video-text foundational model named VAST, which can perceive and process
vision, audio, and subtitle modalities from video, and better support various
tasks including vision-text, audio-text, and multi-modal video-text tasks
(retrieval, captioning and QA). Extensive experiments have been conducted to
demonstrate the effectiveness of our proposed VAST-27M corpus and VAST
foundation model. VAST achieves 22 new state-of-the-art results on various
cross-modality benchmarks. Code, model and dataset will be released at
https://github.com/TXH-mercury/VAST.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.