Related papers: AI-based System for Transforming text and sound to Educational Videos

AI-based System for Transforming text and sound to Educational Videos

URL: http://arxiv.org/abs/2601.17022v1
Date: Fri, 16 Jan 2026 14:00:00 GMT
Title: AI-based System for Transforming text and sound to Educational Videos
Authors: M. E. ElAlami, S. M. Khater, M. El. R. Rehan,
Abstract summary: We introduce a novel method to generate educational videos from conditional inputs such as text or speech.<n>The Generative Adversarial Network (GAN) develops frame-for-frame frameworks and are able to create full educational videos.<n>The proposed system is compared with other systems such as TGAN, MoCoGAN, and TGANS-C, achieving a Fréchet Inception Distance (FID) score of 28.75%.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Technological developments have produced methods that can generate educational videos from input text or sound. Recently, the use of deep learning techniques for image and video generation has been widely explored, particularly in education. However, generating video content from conditional inputs such as text or speech remains a challenging area. In this paper, we introduce a novel method to the educational structure, Generative Adversarial Network (GAN), which develop frame-for-frame frameworks and are able to create full educational videos. The proposed system is structured into three main phases In the first phase, the input (either text or speech) is transcribed using speech recognition. In the second phase, key terms are extracted and relevant images are generated using advanced models such as CLIP and diffusion models to enhance visual quality and semantic alignment. In the final phase, the generated images are synthesized into a video format, integrated with either pre-recorded or synthesized sound, resulting in a fully interactive educational video. The proposed system is compared with other systems such as TGAN, MoCoGAN, and TGANS-C, achieving a Fréchet Inception Distance (FID) score of 28.75%, which indicates improved visual quality and better over existing methods.

Related papers

EasyGenNet: An Efficient Framework for Audio-Driven Gesture Video Generation Based on Diffusion Model [22.286624353800377]
We introduce a new audio-to-video pipeline to synthesize co-speech videos, using 2D human skeleton as the intermediate motion representation.<n>Our experiments show that our method outperforms existing GAN-based and diffusion-based methods.
arXiv Detail & Related papers (2025-04-11T08:19:18Z)
TA-V2A: Textually Assisted Video-to-Audio Generation [9.957113952852051]
Video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation.<n>We present TA-V2A, a method that integrates language, audio, and video features to improve semantic representation in latent space.
arXiv Detail & Related papers (2025-03-12T06:43:24Z)
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks. Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z)
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model. We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Align, Adapt and Inject: Sound-guided Unified Image Generation [50.34667929051005]
We propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. Our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing Text-to-Image (T2I) models. Our proposed AAI outperforms other text and sound-guided state-of-the-art methods.
arXiv Detail & Related papers (2023-06-20T12:50:49Z)
CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge. We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z)
Video Generation from Text Employing Latent Path Construction for Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study. In this paper, we tackle the text to video generation problem, which is a conditional form of video generation. We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z)
Text2Video: Text-driven Talking-head Video Synthesis with Phonetic Dictionary [10.590649169151055]
We present a novel approach to synthesize video from the text. The method builds a phoneme-pose dictionary and trains a generative adversarial network (GAN) to generate video. Compared to audio-driven video generation algorithms, our approach has a number of advantages.
arXiv Detail & Related papers (2021-04-29T19:54:41Z)
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary Generator [34.7504057664375]
We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video. Step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions.
arXiv Detail & Related papers (2020-09-04T06:33:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.