3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social
Media Short Videos
- URL: http://arxiv.org/abs/2203.14456v1
- Date: Mon, 28 Mar 2022 02:47:01 GMT
- Title: 3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social
Media Short Videos
- Authors: Vikram Gupta, Trisha Mittal, Puneet Mathur, Vaibhav Mishra, Mayank
Maheshwari, Aniket Bera, Debdoot Mukherjee, Dinesh Manocha
- Abstract summary: We present 3MASSIV, a multilingual, multimodal and multi-aspect, expertly-annotated dataset of diverse short videos extracted from short-video social media platform - Moj.
3MASSIV comprises of 50k short videos (20 seconds average duration) and 100K unlabeled videos in 11 different languages.
We show how the social media content in 3MASSIV is dynamic and temporal in nature, which can be used for semantic understanding tasks and cross-lingual analysis.
- Score: 72.69052180249598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present 3MASSIV, a multilingual, multimodal and multi-aspect,
expertly-annotated dataset of diverse short videos extracted from short-video
social media platform - Moj. 3MASSIV comprises of 50k short videos (20 seconds
average duration) and 100K unlabeled videos in 11 different languages and
captures popular short video trends like pranks, fails, romance, comedy
expressed via unique audio-visual formats like self-shot videos, reaction
videos, lip-synching, self-sung songs, etc. 3MASSIV presents an opportunity for
multimodal and multilingual semantic understanding on these unique videos by
annotating them for concepts, affective states, media types, and audio
language. We present a thorough analysis of 3MASSIV and highlight the variety
and unique aspects of our dataset compared to other contemporary popular
datasets with strong baselines. We also show how the social media content in
3MASSIV is dynamic and temporal in nature, which can be used for semantic
understanding tasks and cross-lingual analysis.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - Vript: A Video Is Worth Thousands of Words [54.815686588378156]
Vript is an annotated corpus of 12K high-resolution videos, offering detailed, dense, and script-like captions for over 420K clips.
Each clip has a caption of 145 words, which is over 10x longer than most video-text datasets.
Vript is a powerful model capable of end-to-end generation of dense and detailed captions for long videos.
arXiv Detail & Related papers (2024-06-10T06:17:55Z) - M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset [26.339836754484082]
We propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (M$3$AV)
M$3$AV has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics.
With high-quality human annotations of the slide text and spoken words, the dataset can be used for multiple audio-visual recognition and understanding tasks.
arXiv Detail & Related papers (2024-03-21T06:43:59Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - MultiVENT: Multilingual Videos of Events with Aligned Natural Text [29.266266741468055]
MultiVENT is a dataset of multilingual, event-centric videos grounded in text documents across five target languages.
We analyze the state of online news videos and how they can be leveraged to build robust, factually accurate models.
arXiv Detail & Related papers (2023-07-06T17:29:34Z) - A Multimodal Sentiment Dataset for Video Recommendation [21.44500591776281]
We propose a multimodal sentiment analysis dataset, named baiDu Video Sentiment dataset (DuVideoSenti)
DuVideoSenti consists of 5,630 videos which displayed on Baidu.
Each video is manually annotated with a sentimental style label which describes the user's real feeling of a video.
arXiv Detail & Related papers (2021-09-17T03:10:42Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.