MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and
GENeration
- URL: http://arxiv.org/abs/2204.08058v2
- Date: Wed, 20 Apr 2022 19:32:57 GMT
- Title: MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and
GENeration
- Authors: Thomas Hayes, Songyang Zhang, Xi Yin, Guan Pang, Sasha Sheng, Harry
Yang, Songwei Ge, Qiyuan Hu, and Devi Parikh
- Abstract summary: Multimodal video-audio-text understanding and generation can benefit from datasets that are narrow but rich.
We present a large-scale video-audio-text dataset MUGEN, collected using the open-sourced platform game CoinRun.
We sample 375K video clips (3.2s each) and collect text descriptions from human annotators.
- Score: 46.19536568693307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal video-audio-text understanding and generation can benefit from
datasets that are narrow but rich. The narrowness allows bite-sized challenges
that the research community can make progress on. The richness ensures we are
making progress along the core challenges. To this end, we present a
large-scale video-audio-text dataset MUGEN, collected using the open-sourced
platform game CoinRun [11]. We made substantial modifications to make the game
richer by introducing audio and enabling new interactions. We trained RL agents
with different objectives to navigate the game and interact with 13 objects and
characters. This allows us to automatically extract a large collection of
diverse videos and associated audio. We sample 375K video clips (3.2s each) and
collect text descriptions from human annotators. Each video has additional
annotations that are extracted automatically from the game engine, such as
accurate semantic maps for each frame and templated textual descriptions.
Altogether, MUGEN can help progress research in many tasks in multimodal
understanding and generation. We benchmark representative approaches on tasks
involving video-audio-text retrieval and generation. Our dataset and code are
released at: https://mugen-org.github.io/.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - Read, Watch and Scream! Sound Generation from Text and Video [23.990569918960315]
We propose a novel video-and-text-to-sound generation method called ReWaS.
Our method estimates the structural information of audio from the video while receiving key content cues from a user prompt.
By separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences.
arXiv Detail & Related papers (2024-07-08T01:59:17Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking
Head [82.69233563811487]
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition.
We propose a multi-modal AI system named AudioGPT, which complements LLMs with foundation models to process complex audio information.
arXiv Detail & Related papers (2023-04-25T17:05:38Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.