Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
- URL: http://arxiv.org/abs/2411.13209v1
- Date: Wed, 20 Nov 2024 11:18:05 GMT
- Title: Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis
- Authors: Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Sushant Gautam, Saeed S. Sabet, Dag Johansen, Michael A. Riegler, Pål Halvorsen,
- Abstract summary: We propose and implement a fully integrated system that replaces conventional AFE models with Open AI's Whisper.
We show that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions.
- Score: 3.210706100833053
- License:
- Abstract: This paper examines the integration of real-time talking-head generation for interviewer training, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with Open AI's Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. These advancements make the system a more effective tool for immersive, interactive training applications, expanding the potential of AI-driven avatars in interviewer training.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Enabling Real-Time Conversations with Minimal Training Costs [61.80370154101649]
This paper presents a new duplex decoding approach that enhances large language models with duplex ability, requiring minimal training.
Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.
arXiv Detail & Related papers (2024-09-18T06:27:26Z) - Synesthesia of Machines (SoM)-Enhanced ISAC Precoding for Vehicular Networks with Double Dynamics [15.847713094328286]
Integrated sensing and communication (ISAC) technology plays a crucial role in vehicular networks.
Double dynamics present significant challenges for real-time ISAC precoding design.
We propose a synesthesia of machine (SoM)-enhanced precoding paradigm.
arXiv Detail & Related papers (2024-08-24T10:35:10Z) - Heterogeneous Space Fusion and Dual-Dimension Attention: A New Paradigm for Speech Enhancement [7.789114492151524]
We introduce a novel speech enhancement framework, HFSDA, which integrates heterogeneous spatial features and incorporates a dual-dimension attention mechanism.
Our model excels at capturing both high-level semantic information and detailed spectral data, enabling a more thorough analysis and refinement of speech signals.
We refine the Conformer model by enhancing its feature extraction capabilities not only in the temporal dimension but also across the spectral domain.
arXiv Detail & Related papers (2024-08-13T14:04:24Z) - RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network [48.95833484103569]
RealTalk is an audio-to-expression transformer and a high-fidelity expression-to-face framework.
In the first component, we consider both identity and intra-personal variation features related to speaking lip movements.
In the second component, we design a lightweight facial identity alignment (FIA) module.
This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules.
arXiv Detail & Related papers (2024-06-26T12:09:59Z) - Thread Detection and Response Generation using Transformers with Prompt
Optimisation [5.335657953493376]
This paper develops an end-to-end model that identifies threads and prioritises their response generation based on the importance.
The model achieves up to 10x speed improvement, while generating more coherent results compared to existing models.
arXiv Detail & Related papers (2024-03-09T14:50:20Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video
Emotion Recognition Inference [6.279057784373124]
In this paper, we design a fully multimodal video-to-emotion system (FV2ES) for fast yet effective recognition inference.
The adoption of the hierarchical attention method upon the sound spectra breaks through the limited contribution of the acoustic modality.
The further integration of data pre-processing into the aligned multimodal learning model allows the significant reduction of computational costs and storage space.
arXiv Detail & Related papers (2022-09-21T08:05:26Z) - VIRT: Improving Representation-based Models for Text Matching through
Virtual Interaction [50.986371459817256]
We propose a novel textitVirtual InteRacTion mechanism, termed as VIRT, to enable full and deep interaction modeling in representation-based models.
VIRT asks representation-based encoders to conduct virtual interactions to mimic the behaviors as interaction-based models do.
arXiv Detail & Related papers (2021-12-08T09:49:28Z) - Towards Intelligibility-Oriented Audio-Visual Speech Enhancement [8.19144665585397]
We present a fully convolutional AV SE model that uses a modified short-time objective intelligibility (STOI) metric as a training cost function.
Our proposed I-O AV SE framework outperforms audio-only (AO) and AV models trained with conventional distance-based loss functions.
arXiv Detail & Related papers (2021-11-18T11:47:37Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.