Related papers: AutoLV: Automatic Lecture Video Generator

AutoLV: Automatic Lecture Video Generator

URL: http://arxiv.org/abs/2209.08795v1
Date: Mon, 19 Sep 2022 07:00:14 GMT
Title: AutoLV: Automatic Lecture Video Generator
Authors: Wenbin Wang, Yang Song, Sanjay Jha
Abstract summary: We propose an end-to-end lecture video generation system. It can generate realistic and complete lecture videos directly from annotated slides, instructor's reference voice and instructor's reference portrait video.
Score: 16.73368874008744
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose an end-to-end lecture video generation system that can generate realistic and complete lecture videos directly from annotated slides, instructor's reference voice and instructor's reference portrait video. Our system is primarily composed of a speech synthesis module with few-shot speaker adaptation and an adversarial learning-based talking-head generation module. It is capable of not only reducing instructors' workload but also changing the language and accent which can help the students follow the lecture more easily and enable a wider dissemination of lecture contents. Our experimental results show that the proposed model outperforms other current approaches in terms of authenticity, naturalness and accuracy. Here is a video demonstration of how our system works, and the outcomes of the evaluation and comparison: https://youtu.be/cY6TYkI0cog.

Related papers

Generating Narrated Lecture Videos from Slides with Synchronized Highlights [55.2480439325792]
We introduce an end-to-end system designed to automate the process of turning static slides into video lectures.<n>This system synthesizes a video lecture featuring AI-generated narration precisely synchronized with dynamic visual highlights.<n>We demonstrate the system's effectiveness through a technical evaluation using a manually annotated slide dataset with 1000 samples.
arXiv Detail & Related papers (2025-05-05T18:51:53Z)
Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries [0.054204929130712134]
The prototype utilizes machine learning-based techniques to recognise selected didactic and behavioural teachers' features within lecture video recordings. The system offers flexibility for (future) integration of new/additional machine-learning models and software modules for image and video analysis.
arXiv Detail & Related papers (2024-06-20T12:45:23Z)
Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements. The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone. We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z)
Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment. Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules. It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z)
Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions. To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z)
CLUE: Contextualised Unified Explainable Learning of User Engagement in Video Lectures [6.25256391074865]
We propose a new unified model, CLUE, which learns from the features extracted from public online teaching videos. Our model exploits various multi-modal features to model the complexity of language, context information, textual emotion of the delivered content.
arXiv Detail & Related papers (2022-01-14T19:51:06Z)
LiRA: Learning Visual Speech Representations from Audio through Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA) Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z)
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning. Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks. Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z)
Vocoder-Based Speech Synthesis from Silent Videos [28.94460283719776]
We present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm.
arXiv Detail & Related papers (2020-04-06T10:22:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.