M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset
- URL: http://arxiv.org/abs/2403.14168v3
- Date: Tue, 4 Jun 2024 04:05:09 GMT
- Title: M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset
- Authors: Zhe Chen, Heyang Liu, Wenyi Yu, Guangzhi Sun, Hongcheng Liu, Ji Wu, Chao Zhang, Yu Wang, Yanfeng Wang,
- Abstract summary: We propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (M$3$AV)
M$3$AV has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics.
With high-quality human annotations of the slide text and spoken words, the dataset can be used for multiple audio-visual recognition and understanding tasks.
- Score: 26.339836754484082
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts and pictures in the slides and possibly even the papers. Although multiple academic video datasets have been constructed and released, few of them support both multimodal content recognition and understanding tasks, which is partially due to the lack of high-quality human annotations. In this paper, we propose a novel multimodal, multigenre, and multipurpose audio-visual academic lecture dataset (M$^3$AV), which has almost 367 hours of videos from five sources covering computer science, mathematics, and medical and biology topics. With high-quality human annotations of the slide text and spoken words, in particular high-valued name entities, the dataset can be used for multiple audio-visual recognition and understanding tasks. Evaluations performed on contextual speech recognition, speech synthesis, and slide and script generation tasks demonstrate that the diversity of M$^3$AV makes it a challenging dataset.
Related papers
- MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and
Dataset [53.46019570679092]
We propose a Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multi-modal understanding and generation.
VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
It achieves new state-of-the-art performances on series of public cross-modality benchmarks.
arXiv Detail & Related papers (2023-04-17T15:08:15Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - 3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social
Media Short Videos [72.69052180249598]
We present 3MASSIV, a multilingual, multimodal and multi-aspect, expertly-annotated dataset of diverse short videos extracted from short-video social media platform - Moj.
3MASSIV comprises of 50k short videos (20 seconds average duration) and 100K unlabeled videos in 11 different languages.
We show how the social media content in 3MASSIV is dynamic and temporal in nature, which can be used for semantic understanding tasks and cross-lingual analysis.
arXiv Detail & Related papers (2022-03-28T02:47:01Z) - Classification of Important Segments in Educational Videos using
Multimodal Features [10.175871202841346]
We propose a multimodal neural architecture that utilizes state-of-the-art audio, visual and textual features.
Our experiments investigate the impact of visual and temporal information, as well as the combination of multimodal features on importance prediction.
arXiv Detail & Related papers (2020-10-26T14:40:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.