AI based Presentation Creator With Customized Audio Content Delivery
- URL: http://arxiv.org/abs/2106.14213v1
- Date: Sun, 27 Jun 2021 12:17:11 GMT
- Title: AI based Presentation Creator With Customized Audio Content Delivery
- Authors: Muvazima Mansoor, Srikanth Chandar, Ramamoorthy Srinath
- Abstract summary: This paper aims to use Machine Learning (ML) algorithms and Natural Language Processing (NLP) modules to automate the process of creating a slides-based presentation from a document.
We then use state-of-the-art voice cloning models to deliver the content in the desired author's voice.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose an architecture to solve a novel problem statement
that has stemmed more so in recent times with an increase in demand for virtual
content delivery due to the COVID-19 pandemic. All educational institutions,
workplaces, research centers, etc. are trying to bridge the gap of
communication during these socially distanced times with the use of online
content delivery. The trend now is to create presentations, and then
subsequently deliver the same using various virtual meeting platforms. The time
being spent in such creation of presentations and delivering is what we try to
reduce and eliminate through this paper which aims to use Machine Learning (ML)
algorithms and Natural Language Processing (NLP) modules to automate the
process of creating a slides-based presentation from a document, and then use
state-of-the-art voice cloning models to deliver the content in the desired
author's voice. We consider a structured document such as a research paper to
be the content that has to be presented. The research paper is first summarized
using BERT summarization techniques and condensed into bullet points that go
into the slides. Tacotron inspired architecture with Encoder, Synthesizer, and
a Generative Adversarial Network (GAN) based vocoder, is used to convey the
contents of the slides in the author's voice (or any customized voice). Almost
all learning has now been shifted to online mode, and professionals are now
working from the comfort of their homes. Due to the current situation, teachers
and professionals have shifted to presentations to help them in imparting
information. In this paper, we aim to reduce the considerable amount of time
that is taken in creating a presentation by automating this process and
subsequently delivering this presentation in a customized voice, using a
content delivery mechanism that can clone any voice using a short audio clip.
Related papers
- IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities [55.11130688075417]
We introduce IntrinsicVoic,e an LLM designed with intrinsic real-time voice interaction capabilities.
Our novelty architecture, GroupFormer, can reduce speech sequences to lengths comparable to text sequences.
We construct a multi-turn speech-to-speech dialogue dataset named method-500k which includes nearly 500k turns of speech-to-speech dialogues.
arXiv Detail & Related papers (2024-10-09T05:04:31Z) - Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries [0.054204929130712134]
The prototype utilizes machine learning-based techniques to recognise selected didactic and behavioural teachers' features within lecture video recordings.
The system offers flexibility for (future) integration of new/additional machine-learning models and software modules for image and video analysis.
arXiv Detail & Related papers (2024-06-20T12:45:23Z) - Towards Accurate Lip-to-Speech Synthesis in-the-Wild [31.289366690147556]
We introduce a novel approach to address the task of synthesizing speech from silent videos of any in-the-wild speaker solely based on lip movements.
The traditional approach of directly generating speech from lip videos faces the challenge of not being able to learn a robust language model from speech alone.
We propose incorporating noisy text supervision using a state-of-the-art lip-to-text network that instills language information into our model.
arXiv Detail & Related papers (2024-03-02T04:07:24Z) - Cross-Utterance Conditioned VAE for Speech Generation [27.5887600344053]
We present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation.
We propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing.
arXiv Detail & Related papers (2023-09-08T06:48:41Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - Multimodal Lecture Presentations Dataset: Understanding Multimodality in
Educational Slides [57.86931911522967]
We test the capabilities of machine learning models in multimodal understanding of educational content.
Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects.
We introduce PolyViLT, a multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches.
arXiv Detail & Related papers (2022-08-17T05:30:18Z) - CLUE: Contextualised Unified Explainable Learning of User Engagement in
Video Lectures [6.25256391074865]
We propose a new unified model, CLUE, which learns from the features extracted from public online teaching videos.
Our model exploits various multi-modal features to model the complexity of language, context information, textual emotion of the delivered content.
arXiv Detail & Related papers (2022-01-14T19:51:06Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.