RealityTalk: Real-Time Speech-Driven Augmented Presentation for AR Live
Storytelling
- URL: http://arxiv.org/abs/2208.06350v1
- Date: Fri, 12 Aug 2022 16:12:00 GMT
- Title: RealityTalk: Real-Time Speech-Driven Augmented Presentation for AR Live
Storytelling
- Authors: Jian Liao, Adnan Karim, Shivesh Jadon, Rubaiat Habib Kazi, Ryo Suzuki
- Abstract summary: We present RealityTalk, a system that augments real-time live presentations with speech-driven interactive virtual elements.
Based on our analysis of 177 existing video-edited augmented presentations, we propose a novel set of interaction techniques.
We evaluate our tool from a presenter's perspective to demonstrate the effectiveness of our system.
- Score: 7.330145218077073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present RealityTalk, a system that augments real-time live presentations
with speech-driven interactive virtual elements. Augmented presentations
leverage embedded visuals and animation for engaging and expressive
storytelling. However, existing tools for live presentations often lack
interactivity and improvisation, while creating such effects in video editing
tools require significant time and expertise. RealityTalk enables users to
create live augmented presentations with real-time speech-driven interactions.
The user can interactively prompt, move, and manipulate graphical elements
through real-time speech and supporting modalities. Based on our analysis of
177 existing video-edited augmented presentations, we propose a novel set of
interaction techniques and then incorporated them into RealityTalk. We evaluate
our tool from a presenter's perspective to demonstrate the effectiveness of our
system.
Related papers
- Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation [71.38488610271247]
Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation.<n>Current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement.<n>We propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing.
arXiv Detail & Related papers (2026-01-02T11:58:48Z) - ALIVE: An Avatar-Lecture Interactive Video Engine with Content-Aware Retrieval for Real-Time Interaction [5.691710068675227]
ALIVE is an Avatar-Lecture Interactive Video Engine that transforms passive lecture viewing into a dynamic, real-time learning experience.<n>ALIVE employs lightweight embedding models, FAISS-based retrieval, and segmented avatar synthesis with progressive preloading.<n>We demonstrate the system on a complete medical imaging course, evaluate its retrieval accuracy, latency characteristics, and user experience, and show that ALIVE provides accurate, content-aware, and engaging real-time support.
arXiv Detail & Related papers (2025-12-24T00:33:59Z) - Audio Driven Real-Time Facial Animation for Social Telepresence [65.66220599734338]
We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency.<n>Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time.<n>We capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance.
arXiv Detail & Related papers (2025-10-01T17:57:05Z) - EmoCAST: Emotional Talking Portrait via Emotive Text Description [56.42674612728354]
EmoCAST is a diffusion-based framework for precise text-driven emotional synthesis.<n>In appearance modeling, emotional prompts are integrated through a text-guided decoupled emotive module.<n>EmoCAST achieves state-of-the-art performance in generating realistic, emotionally expressive, and audio-synchronized talking-head videos.
arXiv Detail & Related papers (2025-08-28T10:02:06Z) - VisualSpeaker: Visually-Guided 3D Avatar Lip Synthesis [70.76837748695841]
We propose VisualSpeaker, a novel method that bridges the gap using photorealistic differentiable rendering, supervised by visual speech recognition, for improved 3D facial animation.<n>Our contribution is a perceptual lip-reading loss, derived by passing 3D Gaussian Splatting avatar renders through a pre-trained Visual Automatic Speech Recognition model during training.<n> Evaluation on the MEAD dataset demonstrates that VisualSpeaker improves both the standard Lip Vertex Error metric by 56.1% and the perceptual quality of the generated animations, while retaining the controllability of mesh-driven animation.
arXiv Detail & Related papers (2025-07-08T15:04:17Z) - Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z) - MoCha: Towards Movie-Grade Talking Character Synthesis [62.007000023747445]
We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text.
Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region.
We propose MoCha, the first of its kind to generate talking characters.
arXiv Detail & Related papers (2025-03-30T04:22:09Z) - PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation [48.94486508604052]
We introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk.<n>Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet.<n>Key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms.
arXiv Detail & Related papers (2024-12-10T18:51:31Z) - INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations [11.101103116878438]
We propose INFP, a novel audio-driven head generation framework for dyadic interaction.
INFP comprises a Motion-Based Head Imitation stage and an Audio-Guided Motion Generation stage.
To facilitate this line of research, we introduce DyConv, a large scale dataset of rich dyadic conversations collected from the Internet.
arXiv Detail & Related papers (2024-12-05T10:20:34Z) - Speech2rtMRI: Speech-Guided Diffusion Model for Real-time MRI Video of the Vocal Tract during Speech [29.510756530126837]
We introduce a data-driven method to visually represent articulator motion in MRI videos of the human vocal tract during speech.
We leverage large pre-trained speech models, which are embedded with prior knowledge, to generalize the visual domain to unseen data.
arXiv Detail & Related papers (2024-09-23T20:19:24Z) - Real Time Emotion Analysis Using Deep Learning for Education, Entertainment, and Beyond [0.0]
The project consists of two components.
We will employ sophisticated image processing techniques and neural networks to construct a deep learning model capable of precisely categorising facial expressions.
The app will utilise a sophisticated model to promptly analyse facial expressions and promptly exhibit corresponding emojis.
arXiv Detail & Related papers (2024-07-05T14:48:19Z) - RITA: A Real-time Interactive Talking Avatars Framework [6.060251768347276]
RITA presents a high-quality real-time interactive framework built upon generative models.
Our framework enables the transformation of user-uploaded photos into digital avatars that can engage in real-time dialogue interactions.
arXiv Detail & Related papers (2024-06-18T22:53:15Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - What You Say Is What You Show: Visual Narration Detection in
Instructional Videos [108.77600799637172]
We introduce the novel task of visual narration detection, which entails determining whether a narration is visually depicted by the actions in the video.
We propose What You Say is What You Show (WYS2), a method that leverages multi-modal cues and pseudo-labeling to learn to detect visual narrations with only weakly labeled data.
Our model successfully detects visual narrations in in-the-wild videos, outperforming strong baselines, and we demonstrate its impact for state-of-the-art summarization and temporal alignment of instructional videos.
arXiv Detail & Related papers (2023-01-05T21:43:19Z) - Tell Your Story: Task-Oriented Dialogs for Interactive Content Creation [11.538915414185022]
We propose task-oriented dialogs for montage creation as a novel interactive tool to seamlessly search, compile, and edit montages from a media collection.
We collect a new dataset C3 (Conversational Content Creation), comprising 10k dialogs conditioned on media montages simulated from a large media collection.
Our analysis and benchmarking of state-of-the-art language models showcase the multimodal challenges present in the dataset.
arXiv Detail & Related papers (2022-11-08T01:23:59Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.