LASER: Lip Landmark Assisted Speaker Detection for Robustness
- URL: http://arxiv.org/abs/2501.11899v1
- Date: Tue, 21 Jan 2025 05:29:34 GMT
- Title: LASER: Lip Landmark Assisted Speaker Detection for Robustness
- Authors: Le Thien Phuc Nguyen, Zhuoran Yu, Yong Jae Lee,
- Abstract summary: We propose Lip landmark Assisted Speaker dEtection for Robustness (LASER)
LASER aims to identify speaking individuals in complex visual scenes by matching lip movements to audio.
Experiments show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals.
- Score: 30.82311863795508
- License:
- Abstract: Active Speaker Detection (ASD) aims to identify speaking individuals in complex visual scenes. While humans can easily detect speech by matching lip movements to audio, current ASD models struggle to establish this correspondence, often misclassifying non-speaking instances when audio and lip movements are unsynchronized. To address this limitation, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER). Unlike models that rely solely on facial frames, LASER explicitly focuses on lip movements by integrating lip landmarks in training. Specifically, given a face track, LASER extracts frame-level visual features and the 2D coordinates of lip landmarks using a lightweight detector. These coordinates are encoded into dense feature maps, providing spatial and structural information on lip positions. Recognizing that landmark detectors may sometimes fail under challenging conditions (e.g., low resolution, occlusions, extreme angles), we incorporate an auxiliary consistency loss to align predictions from both lip-aware and face-only features, ensuring reliable performance even when lip data is absent. Extensive experiments across multiple datasets show that LASER outperforms state-of-the-art models, especially in scenarios with desynchronized audio and visuals, demonstrating robust performance in real-world video contexts. Code is available at \url{https://github.com/plnguyen2908/LASER_ASD}.
Related papers
- SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion [78.77211425667542]
SayAnything is a conditional video diffusion framework that directly synthesizes lip movements from audio input.
Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation.
arXiv Detail & Related papers (2025-02-17T07:29:36Z) - LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition [12.336693356113308]
We propose a novel framework, LipGen, to improve model robustness.
We introduce an auxiliary task that incorporates viseme classification alongside attention mechanisms.
Our method demonstrates superior performance compared to the current state-of-the-art on the lip reading in the wild (LRW) dataset.
arXiv Detail & Related papers (2025-01-08T00:52:19Z) - MuseTalk: Real-Time High Quality Lip Synchronization with Latent Space Inpainting [12.852715177163608]
MuseTalk generates lip-sync targets in a latent space encoded by a Variational Autoencoder.
It supports the online generation of face at 256x256 at more than 30 FPS with negligible starting latency.
arXiv Detail & Related papers (2024-10-14T03:22:26Z) - LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition [46.438575751932866]
LipGER is a framework for leveraging visual cues for noise-robust ASR.
We show that LipGER improves the Word Error Rate in the range of 1.1%-49.2%.
We also release LipHyp, a large-scale dataset with hypothesis-transcription pairs equipped with lip motion cues.
arXiv Detail & Related papers (2024-06-06T18:17:59Z) - Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing [56.71450690166821]
We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM)
VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation.
We show that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements.
arXiv Detail & Related papers (2024-02-23T07:21:32Z) - Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a
Short Video [91.92782707888618]
We present a decomposition-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance.
We show that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization.
arXiv Detail & Related papers (2023-09-09T14:52:39Z) - Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation [58.72068260933836]
Context-Aware LipSync- framework (CALS)
CALS is comprised of an Audio-to-Lip map module and a Lip-to-Face module.
arXiv Detail & Related papers (2023-05-31T04:50:32Z) - Seeing What You Said: Talking Face Generation Guided by a Lip Reading
Expert [89.07178484337865]
Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input.
Previous studies revealed the importance of lip-speech synchronization and visual quality.
We propose using a lip-reading expert to improve the intelligibility of the generated lip regions.
arXiv Detail & Related papers (2023-03-29T07:51:07Z) - SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via
Audio-Lip Memory [27.255990661166614]
The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio.
Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models.
We propose Audio-Lip Memory that brings in visual information of the mouth region corresponding to input audio and enforces fine-grained audio-visual coherence.
arXiv Detail & Related papers (2022-11-02T07:17:49Z) - Sub-word Level Lip Reading With Visual Attention [88.89348882036512]
We focus on the unique challenges encountered in lip reading and propose tailored solutions.
We obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets.
Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models.
arXiv Detail & Related papers (2021-10-14T17:59:57Z) - Three-Dimensional Lip Motion Network for Text-Independent Speaker
Recognition [24.433021731098474]
Lip motion reflects behavior characteristics of speakers, and can be used as a new kind of biometrics in speaker recognition.
We present a novel end-to-end 3D lip motion Network (3LMNet) by utilizing the sentence-level 3D lip motion.
A new regional feedback module (RFM) is proposed to obtain attentions in different lip regions.
arXiv Detail & Related papers (2020-10-13T13:18:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.