ViDA-MAN: Visual Dialog with Digital Humans
- URL: http://arxiv.org/abs/2110.13384v1
- Date: Tue, 26 Oct 2021 03:23:51 GMT
- Title: ViDA-MAN: Visual Dialog with Digital Humans
- Authors: Tong Shen, Jiawei Zuo, Fan Shi, Jin Zhang, Liqin Jiang, Meng Chen,
Zhengchen Zhang, Wei Zhang, Xiaodong He, Tao Mei
- Abstract summary: Given a speech request, ViDA-MAN is able to response with high quality videos in sub-second latency.
Backed with large knowledge base, ViDA-MAN is able to chat with users on a number of topics including chit-chat, weather, device control, News recommendations, booking hotels, as well as answering questions via structured knowledge.
- Score: 50.218369825060876
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction,
which offers realtime audio-visual responses to instant speech inquiries.
Compared to traditional text or voice-based system, ViDA-MAN offers human-like
interactions (e.g, vivid voice, natural facial expression and body gestures).
Given a speech request, the demonstration is able to response with high quality
videos in sub-second latency. To deliver immersive user experience, ViDA-MAN
seamlessly integrates multi-modal techniques including Acoustic Speech
Recognition (ASR), multi-turn dialog, Text To Speech (TTS), talking heads video
generation. Backed with large knowledge base, ViDA-MAN is able to chat with
users on a number of topics including chit-chat, weather, device control, News
recommendations, booking hotels, as well as answering questions via structured
knowledge.
Related papers
- Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - Neural Text to Articulate Talk: Deep Text to Audiovisual Speech
Synthesis achieving both Auditory and Photo-realism [26.180371869137257]
State of the art in talking face generation focuses mainly on lip-syncing, being conditioned on audio clips.
NEUral Text to ARticulate Talk (NEUTART) is a talking face generator that uses a joint audiovisual feature space.
Model produces photorealistic talking face videos with human-like articulation and well-synced audiovisual streams.
arXiv Detail & Related papers (2023-12-11T18:41:55Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z) - Responsive Listening Head Generation: A Benchmark Dataset and Baseline [58.168958284290156]
We define the responsive listening head generation task as the synthesis of a non-verbal head with motions and expressions reacting to the multiple inputs.
Unlike speech-driven gesture or talking head generation, we introduce more modals in this task, hoping to benefit several research fields.
arXiv Detail & Related papers (2021-12-27T07:18:50Z) - Intelligent Conversational Android ERICA Applied to Attentive Listening
and Job Interview [41.789773897391605]
We have developed an intelligent conversational android ERICA.
We set up several social interaction tasks for ERICA, including attentive listening, job interview, and speed dating.
It has been evaluated with 40 senior people, engaged in conversation of 5-7 minutes without a conversation breakdown.
arXiv Detail & Related papers (2021-05-02T06:37:23Z) - Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge [48.905496060794114]
We describe our submission to the AVSD track of the 8th Dialogue System Technology Challenge.
We adopt dot-product attention to combine text and non-text features of input video.
Our systems achieve high performance in automatic metrics and obtain 5th and 6th place in human evaluation.
arXiv Detail & Related papers (2020-02-25T06:41:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.