RealityTalk: Real-Time Speech-Driven Augmented Presentation for AR Live
Storytelling
- URL: http://arxiv.org/abs/2208.06350v1
- Date: Fri, 12 Aug 2022 16:12:00 GMT
- Title: RealityTalk: Real-Time Speech-Driven Augmented Presentation for AR Live
Storytelling
- Authors: Jian Liao, Adnan Karim, Shivesh Jadon, Rubaiat Habib Kazi, Ryo Suzuki
- Abstract summary: We present RealityTalk, a system that augments real-time live presentations with speech-driven interactive virtual elements.
Based on our analysis of 177 existing video-edited augmented presentations, we propose a novel set of interaction techniques.
We evaluate our tool from a presenter's perspective to demonstrate the effectiveness of our system.
- Score: 7.330145218077073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present RealityTalk, a system that augments real-time live presentations
with speech-driven interactive virtual elements. Augmented presentations
leverage embedded visuals and animation for engaging and expressive
storytelling. However, existing tools for live presentations often lack
interactivity and improvisation, while creating such effects in video editing
tools require significant time and expertise. RealityTalk enables users to
create live augmented presentations with real-time speech-driven interactions.
The user can interactively prompt, move, and manipulate graphical elements
through real-time speech and supporting modalities. Based on our analysis of
177 existing video-edited augmented presentations, we propose a novel set of
interaction techniques and then incorporated them into RealityTalk. We evaluate
our tool from a presenter's perspective to demonstrate the effectiveness of our
system.
Related papers
- Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech [29.510756530126837]
We introduce a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech.
We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data.
arXiv Detail & Related papers (2024-09-23T20:19:24Z) - Real Time Emotion Analysis Using Deep Learning for Education, Entertainment, and Beyond [0.0]
The project consists of two components.
We will employ sophisticated image processing techniques and neural networks to construct a deep learning model capable of precisely categorising facial expressions.
The app will utilise a sophisticated model to promptly analyse facial expressions and promptly exhibit corresponding emojis.
arXiv Detail & Related papers (2024-07-05T14:48:19Z) - RITA: A Real-time Interactive Talking Avatars Framework [6.060251768347276]
RITA presents a high-quality real-time interactive framework built upon generative models.
Our framework enables the transformation of user-uploaded photos into digital avatars that can engage in real-time dialogue interactions.
arXiv Detail & Related papers (2024-06-18T22:53:15Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - What You Say Is What You Show: Visual Narration Detection in
Instructional Videos [108.77600799637172]
We introduce the novel task of visual narration detection, which entails determining whether a narration is visually depicted by the actions in the video.
We propose What You Say is What You Show (WYS2), a method that leverages multi-modal cues and pseudo-labeling to learn to detect visual narrations with only weakly labeled data.
Our model successfully detects visual narrations in in-the-wild videos, outperforming strong baselines, and we demonstrate its impact for state-of-the-art summarization and temporal alignment of instructional videos.
arXiv Detail & Related papers (2023-01-05T21:43:19Z) - Tell Your Story: Task-Oriented Dialogs for Interactive Content Creation [11.538915414185022]
We propose task-oriented dialogs for montage creation as a novel interactive tool to seamlessly search, compile, and edit montages from a media collection.
We collect a new dataset C3 (Conversational Content Creation), comprising 10k dialogs conditioned on media montages simulated from a large media collection.
Our analysis and benchmarking of state-of-the-art language models showcase the multimodal challenges present in the dataset.
arXiv Detail & Related papers (2022-11-08T01:23:59Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.