FlexLip: A Controllable Text-to-Lip System
- URL: http://arxiv.org/abs/2206.03206v1
- Date: Tue, 7 Jun 2022 11:51:58 GMT
- Title: FlexLip: A Controllable Text-to-Lip System
- Authors: Dan Oneata, Beata Lorincz, Adriana Stan and Horia Cucu
- Abstract summary: We tackle a subissue of the text-to-video generation problem, by converting the text into lip landmarks.
Our system, entitled FlexLip, is split into two separate modules: text-to-speech and speech-to-lip.
We show that by using as little as 20 min of data for the audio generation component, and as little as 5 min for the speech-to-lip component, the objective measures of the generated lip landmarks are comparable with those obtained when using a larger set of training samples.
- Score: 6.15560473113783
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of converting text input into video content is becoming an important
topic for synthetic media generation. Several methods have been proposed with
some of them reaching close-to-natural performances in constrained tasks. In
this paper, we tackle a subissue of the text-to-video generation problem, by
converting the text into lip landmarks. However, we do this using a modular,
controllable system architecture and evaluate each of its individual
components. Our system, entitled FlexLip, is split into two separate modules:
text-to-speech and speech-to-lip, both having underlying controllable deep
neural network architectures. This modularity enables the easy replacement of
each of its components, while also ensuring the fast adaptation to new speaker
identities by disentangling or projecting the input features. We show that by
using as little as 20 min of data for the audio generation component, and as
little as 5 min for the speech-to-lip component, the objective measures of the
generated lip landmarks are comparable with those obtained when using a larger
set of training samples. We also introduce a series of objective evaluation
measures over the complete flow of our system by taking into consideration
several aspects of the data and system configuration. These aspects pertain to
the quality and amount of training data, the use of pretrained models, and the
data contained therein, as well as the identity of the target speaker; with
regard to the latter, we show that we can perform zero-shot lip adaptation to
an unseen identity by simply updating the shape of the lips in our model.
Related papers
- FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - Leveraging Generative Language Models for Weakly Supervised Sentence
Component Analysis in Video-Language Joint Learning [10.486585276898472]
A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks.
We postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models.
We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks.
arXiv Detail & Related papers (2023-12-10T02:03:51Z) - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
Adaptation [89.96013329530484]
We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes.
We utilize an existing text-conditioned video generation model and a pre-trained audio encoder model.
We validate our method extensively on three datasets demonstrating significant semantic diversity of audio-video samples.
arXiv Detail & Related papers (2023-09-28T13:26:26Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - An analysis on the effects of speaker embedding choice in non
auto-regressive TTS [4.619541348328938]
We introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets.
We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well.
arXiv Detail & Related papers (2023-07-19T10:57:54Z) - ASiT: Local-Global Audio Spectrogram vIsion Transformer for Event
Classification [42.95038619688867]
ASiT is a novel self-supervised learning framework that captures local and global contextual information by employing group masked model learning and self-distillation.
We evaluate our pretrained models on both audio and speech classification tasks, including audio event classification, keyword spotting, and speaker identification.
arXiv Detail & Related papers (2022-11-23T18:21:09Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.