Classification of Important Segments in Educational Videos using
Multimodal Features
- URL: http://arxiv.org/abs/2010.13626v1
- Date: Mon, 26 Oct 2020 14:40:23 GMT
- Title: Classification of Important Segments in Educational Videos using
Multimodal Features
- Authors: Junaid Ahmed Ghauri, Sherzod Hakimov and Ralph Ewerth
- Abstract summary: We propose a multimodal neural architecture that utilizes state-of-the-art audio, visual and textual features.
Our experiments investigate the impact of visual and temporal information, as well as the combination of multimodal features on importance prediction.
- Score: 10.175871202841346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Videos are a commonly-used type of content in learning during Web search.
Many e-learning platforms provide quality content, but sometimes educational
videos are long and cover many topics. Humans are good in extracting important
sections from videos, but it remains a significant challenge for computers. In
this paper, we address the problem of assigning importance scores to video
segments, that is how much information they contain with respect to the overall
topic of an educational video. We present an annotation tool and a new dataset
of annotated educational videos collected from popular online learning
platforms. Moreover, we propose a multimodal neural architecture that utilizes
state-of-the-art audio, visual and textual features. Our experiments
investigate the impact of visual and temporal information, as well as the
combination of multimodal features on importance prediction.
Related papers
- Deep video representation learning: a survey [4.9589745881431435]
We recent sequential feature learning methods for visual data and compare their pros and cons for general video analysis.
Building effective features for videos is a fundamental problem in computer vision tasks involving video analysis and understanding.
arXiv Detail & Related papers (2024-05-10T16:20:11Z) - FastPerson: Enhancing Video Learning through Effective Video Summarization that Preserves Linguistic and Visual Contexts [23.6178079869457]
We propose FastPerson, a video summarization approach that considers both the visual and auditory information in lecture videos.
FastPerson creates summary videos by utilizing audio transcriptions along with on-screen images and text.
It reduces viewing time by 53% at the same level of comprehension as that when using traditional video playback methods.
arXiv Detail & Related papers (2024-03-26T14:16:56Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Just a Glimpse: Rethinking Temporal Information for Video Continual
Learning [58.7097258722291]
We propose a novel replay mechanism for effective video continual learning based on individual/single frames.
Under extreme memory constraints, video diversity plays a more significant role than temporal information.
Our method achieves state-of-the-art performance, outperforming the previous state-of-the-art by up to 21.49%.
arXiv Detail & Related papers (2023-05-28T19:14:25Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - VLEngagement: A Dataset of Scientific Video Lectures for Evaluating
Population-based Engagement [23.078055803229912]
Video lectures have become one of the primary modalities to impart knowledge to masses in the current digital age.
There is still an important need for data and research aimed at understanding learner engagement with scientific video lectures.
This paper introduces VLEngagement, a novel dataset that consists of content-based and video-specific features extracted from publicly available scientific video lectures.
arXiv Detail & Related papers (2020-11-02T14:20:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.