Related papers: High-Speed and High-Quality Text-to-Lip Generation

High-Speed and High-Quality Text-to-Lip Generation

URL: http://arxiv.org/abs/2107.06831v1
Date: Wed, 14 Jul 2021 16:44:04 GMT
Title: High-Speed and High-Quality Text-to-Lip Generation
Authors: Jinglin Liu, Zhiying Zhu, Yi Ren and Zhou Zhao
Abstract summary: We propose a novel parallel decoding model for high-speed and high-quality text-to-lip generation (HH-T2L) We predict the duration of the encoded linguistic features and model the target lip frames conditioned on the encoded linguistic features with their duration in a non-autoregressive manner. Experiments conducted on GRID and TCD-TIMIT datasets show that HH-T2L generates lip movements with competitive quality compared with the state-of-the-art AR T2L model DualLip.
Score: 55.20612501355773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As a key component of talking face generation, lip movements generation determines the naturalness and coherence of the generated talking face video. Prior literature mainly focuses on speech-to-lip generation while there is a paucity in text-to-lip (T2L) generation. T2L is a challenging task and existing end-to-end works depend on the attention mechanism and autoregressive (AR) decoding manner. However, the AR decoding manner generates current lip frame conditioned on frames generated previously, which inherently hinders the inference speed, and also has a detrimental effect on the quality of generated lip frames due to error propagation. This encourages the research of parallel T2L generation. In this work, we propose a novel parallel decoding model for high-speed and high-quality text-to-lip generation (HH-T2L). Specifically, we predict the duration of the encoded linguistic features and model the target lip frames conditioned on the encoded linguistic features with their duration in a non-autoregressive manner. Furthermore, we incorporate the structural similarity index loss and adversarial learning to improve perceptual quality of generated lip frames and alleviate the blurry prediction problem. Extensive experiments conducted on GRID and TCD-TIMIT datasets show that 1) HH-T2L generates lip movements with competitive quality compared with the state-of-the-art AR T2L model DualLip and exceeds the baseline AR model TransformerT2L by a notable margin benefiting from the mitigation of the error propagation problem; and 2) exhibits distinct superiority in inference speed (an average speedup of 19$\times$ than DualLip on TCD-TIMIT).

Related papers

Retrieval, Refinement, and Ranking for Text-to-Video Generation via Prompt Optimization and Test-Time Scaling [1.6671050178877669]
Large-scale datasets have driven significant progress in Text-to-Video (T2V) generative models.<n>Current methods for improving video output often fall short.<n>We introduce 3R, a novel RAG based prompt optimization framework.
arXiv Detail & Related papers (2026-03-02T06:35:59Z)
RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling [59.088798018184235]
textbfRAPO++ is a cross-stage prompt optimization framework.<n>It unifies training-data-aligned refinement, test-time iterative scaling, and large language model fine-tuning.<n> RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility.
arXiv Detail & Related papers (2025-10-23T04:45:09Z)
SCEESR: Semantic-Control Edge Enhancement for Diffusion-Based Super-Resolution [0.8122270502556375]
Real-world image super-resolution must handle complex degradations and inherent reconstruction ambiguities.<n>One-step diffusion models offer speed but often produce structural inaccuracies due to distillation artifacts.<n>We propose a novel SR framework that enhances a one-step diffusion model using a ControlNet mechanism for semantic edge guidance.
arXiv Detail & Related papers (2025-10-22T06:06:01Z)
Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers [24.722647001947923]
We propose a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning.<n>We show that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results.
arXiv Detail & Related papers (2025-10-06T08:26:55Z)
Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective [37.58855048653859]
Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts of autoregressive video generation.<n>Lumos-1 retains autoregressive video generator architecture with minimal architectural modifications.
arXiv Detail & Related papers (2025-07-11T17:59:42Z)
MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation [16.202732894319084]
MoDiT is a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer.<n>Our contributions include: (i) A hierarchical denoising strategy with revised temporal attention and biased self/cross-attention mechanisms, enabling the model to refine lip synchronization.<n> (ii) The integration of 3DMM coefficients to provide explicit spatial constraints, ensuring accurate 3D-informed optical flow prediction.
arXiv Detail & Related papers (2025-07-07T15:13:46Z)
Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers [79.94246924019984]
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation.<n>We propose textbfTemperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions.<n>Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models.
arXiv Detail & Related papers (2025-06-09T17:54:04Z)
Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation [1.3207844222875191]
Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing.<n> Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics.<n>We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance.
arXiv Detail & Related papers (2025-05-31T00:52:17Z)
GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture [12.303324248639266]
We propose a text-to-speech generation approach optimized via a novel dual-branch ArchiTecture (GOAT-TTS) GOAT-TTS combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models.
arXiv Detail & Related papers (2025-04-15T01:44:56Z)
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis [64.12708207721276]
We introduce a novel pseudo-autoregressive (PAR) language modeling approach that unifies AR and NAR modeling. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data.
arXiv Detail & Related papers (2025-04-14T16:03:21Z)
Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion [26.706957163997043]
We propose a framework that integrates temporal-spatial and semantic consistency with Baliteral DDIM inversion. Our method significantly improves perceptual quality, text-image alignment, and temporal coherence, as demonstrated on the MSR-VTT dataset.
arXiv Detail & Related papers (2025-01-08T16:41:31Z)
ConsisSR: Delving Deep into Consistency in Diffusion-based Image Super-Resolution [28.945663118445037]
Real-world image super-resolution (Real-ISR) aims at restoring high-quality (HQ) images from low-quality (LQ) inputs corrupted by unknown and complex degradations. We introduce ConsisSR to handle both semantic and pixel-level consistency.
arXiv Detail & Related papers (2024-10-17T17:41:52Z)
MDT-A2G: Exploring Masked Diffusion Transformers for Co-Speech Gesture Generation [44.74056930805525]
We introduce a novel Masked Diffusion Transformer for co-speech gesture generation, referred to as MDT-A2G. This model employs a mask modeling scheme specifically designed to strengthen temporal relation learning among sequence gestures. Experimental results demonstrate that MDT-A2G excels in gesture generation, boasting a learning speed that is over 6$times$ faster than traditional diffusion transformers.
arXiv Detail & Related papers (2024-08-06T17:29:01Z)
Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT [120.39362661689333]
We present an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities.
arXiv Detail & Related papers (2024-06-05T17:53:26Z)
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V) We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z)
GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation [71.73912454164834]
A modern talking face generation method is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. NeRF has become a popular technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. We propose GeneFace++ to handle these challenges by utilizing the rendering pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process.
arXiv Detail & Related papers (2023-05-01T12:24:09Z)
Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. Previous studies revealed the importance of lip-speech synchronization and visual quality. We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z)
FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [74.04394069262108]
We propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. FastLR achieves the speedup up to 10.97$times$ compared with state-of-the-art lipreading model.
arXiv Detail & Related papers (2020-08-06T08:28:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.