Related papers: SliTraNet: Automatic Detection of Slide Transitions in Lecture Videos using Convolutional Neural Networks

Related papers

Paper2Video: Automatic Video Generation from Scientific Papers [62.634562246594555]
Paper2Video is the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata.<n>We propose PaperTalker, the first multi-agent framework for academic presentation video generation.
arXiv Detail & Related papers (2025-10-06T17:58:02Z)
Generating Narrated Lecture Videos from Slides with Synchronized Highlights [55.2480439325792]
We introduce an end-to-end system designed to automate the process of turning static slides into video lectures.<n>This system synthesizes a video lecture featuring AI-generated narration precisely synchronized with dynamic visual highlights.<n>We demonstrate the system's effectiveness through a technical evaluation using a manually annotated slide dataset with 1000 samples.
arXiv Detail & Related papers (2025-05-05T18:51:53Z)
Learning from Streaming Video with Orthogonal Gradients [62.51504086522027]
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks.
arXiv Detail & Related papers (2025-04-02T17:59:57Z)
PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos [22.39414772037232]
PreMind is a novel multi-agent multimodal framework for understanding/indexing lecture videos. It generates multimodal indexes through three key steps: extracting slide visual content, transcribing speech narratives, and consolidating these visual and speech contents into an integrated understanding. Three innovative mechanisms are also proposed to improve performance: leveraging prior lecture knowledge to refine visual understanding, detecting/correcting speech transcription errors using a VLM, and utilizing a critic agent for dynamic iterative self-reflection in vision analysis.
arXiv Detail & Related papers (2025-02-28T20:17:48Z)
Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment [53.12952107996463]
This work proposes a novel training framework for learning to localize temporal boundaries of procedure steps in training videos. Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps from narrations. To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy.
arXiv Detail & Related papers (2024-09-22T18:40:55Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
Semi-supervised 3D Video Information Retrieval with Deep Neural Network and Bi-directional Dynamic-time Warping Algorithm [14.39527406033429]
The proposed algorithm is designed to handle large video datasets and retrieve the most related videos to a given inquiry video clip. We split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network. We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method.
arXiv Detail & Related papers (2023-09-03T03:10:18Z)
AutoTransition: Learning to Recommend Video Transition Effects [20.384463765702417]
We present the premier work on performing automatic video transitions recommendation (VTR) VTR is given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots. We propose a novel multi-modal matching framework which consists of two parts.
arXiv Detail & Related papers (2022-07-27T12:00:42Z)
Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs. We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z)
Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input. We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture. The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z)
Self-Supervised Learning via multi-Transformation Classification for Action Recognition [10.676377556393527]
We introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions. The representation of the video is learned in a self-supervised manner by classifying seven different transformations. We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks.
arXiv Detail & Related papers (2021-02-20T16:11:26Z)
A Comprehensive Study of Deep Video Action Recognition [35.7068977497202]
Video action recognition is one of the representative tasks for video understanding. We provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition.
arXiv Detail & Related papers (2020-12-11T18:54:08Z)
Video Representation Learning by Recognizing Temporal Transformations [37.59322456034611]
We introduce a novel self-supervised learning approach to learn representations of videos responsive to changes in the motion dynamics. We promote an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions. Our experiments show that networks trained with the proposed method yield representations with improved transfer performance for action recognition.
arXiv Detail & Related papers (2020-07-21T11:43:01Z)
Curriculum By Smoothing [52.08553521577014]
Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification, detection, and segmentation. We propose an elegant curriculum based scheme that smoothes the feature embedding of a CNN using anti-aliasing or low-pass filters. As the amount of information in the feature maps increases during training, the network is able to progressively learn better representations of the data.
arXiv Detail & Related papers (2020-03-03T07:27:44Z)
Dynamic Inference: A New Approach Toward Efficient Video Action Recognition [69.9658249941149]
Action recognition in videos has achieved great success recently, but it remains a challenging task due to the massive computational cost. We propose a general dynamic inference idea to improve inference efficiency by leveraging the variation in the distinguishability of different videos.
arXiv Detail & Related papers (2020-02-09T11:09:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.