Unsupervised Audio-Visual Lecture Segmentation
- URL: http://arxiv.org/abs/2210.16644v1
- Date: Sat, 29 Oct 2022 16:26:34 GMT
- Title: Unsupervised Audio-Visual Lecture Segmentation
- Authors: Darshan Singh S, Anchit Gupta, C. V. Jawahar, Makarand Tapaswi
- Abstract summary: We introduce AVLectures, a dataset consisting of 86 courses with over 2,350 lectures covering various STEM subjects.
Our second contribution is introducing video lecture segmentation that splits lectures into bite-sized topics that show promise in improving learner engagement.
We use these representations to generate segments using a temporally consistent 1-nearest neighbor algorithm, TW-FINCH.
- Score: 31.29084124332193
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Over the last decade, online lecture videos have become increasingly popular
and have experienced a meteoric rise during the pandemic. However,
video-language research has primarily focused on instructional videos or
movies, and tools to help students navigate the growing online lectures are
lacking. Our first contribution is to facilitate research in the educational
domain, by introducing AVLectures, a large-scale dataset consisting of 86
courses with over 2,350 lectures covering various STEM subjects. Each course
contains video lectures, transcripts, OCR outputs for lecture frames, and
optionally lecture notes, slides, assignments, and related educational content
that can inspire a variety of tasks. Our second contribution is introducing
video lecture segmentation that splits lectures into bite-sized topics that
show promise in improving learner engagement. We formulate lecture segmentation
as an unsupervised task that leverages visual, textual, and OCR cues from the
lecture, while clip representations are fine-tuned on a pretext self-supervised
task of matching the narration with the temporally aligned visual content. We
use these representations to generate segments using a temporally consistent
1-nearest neighbor algorithm, TW-FINCH. We evaluate our method on 15 courses
and compare it against various visual and textual baselines, outperforming all
of them. Our comprehensive ablation studies also identify the key factors
driving the success of our approach.
Related papers
- A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Incorporating Domain Knowledge To Improve Topic Segmentation Of Long
MOOC Lecture Videos [4.189643331553923]
We propose an algorithm for automatically detecting different coherent topics present inside a long lecture video.
We use the language model on speech-to-text transcription to capture the implicit meaning of the whole video.
We also leverage the domain knowledge we can capture the way instructor binds and connects different concepts while teaching.
arXiv Detail & Related papers (2020-12-08T13:37:40Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Video Captioning with Guidance of Multimodal Latent Topics [123.5255241103578]
We propose an unified caption framework, M&M TGM, which mines multimodal topics in unsupervised fashion from data.
Compared to pre-defined topics, the mined multimodal topics are more semantically and visually coherent.
The results from extensive experiments conducted on the MSR-VTT and Youtube2Text datasets demonstrate the effectiveness of our proposed model.
arXiv Detail & Related papers (2017-08-31T11:18:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.