Related papers: Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation

Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation

URL: http://arxiv.org/abs/2504.05746v1
Date: Tue, 08 Apr 2025 07:23:28 GMT
Title: Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation
Authors: Zhihua Xu, Tianshui Chen, Zhijing Yang, Siyuan Peng, Keze Wang, Liang Lin,
Abstract summary: Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames.<n>We learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation.
Score: 62.218932509432314
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The paramount challenge in audio-driven One-shot Talking Head Animation (ADOS-THA) lies in capturing subtle imperceptible changes between adjacent video frames. Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames, offering supplementary information that can be pivotal for guiding and supervising talking head animations. In this work, we propose to learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation by a novel Temporal Audio-Visual Correlation Embedding (TAVCE) framework. Specifically, it first learns an audio-visual temporal correlation metric, ensuring the temporal audio relationships of adjacent clips are aligned with the temporal visual relationships of corresponding adjacent video frames. Since the temporal audio relationship contains aligned information about the visual frame, we first integrate it to guide learning more representative features via a simple yet effective channel attention mechanism. During training, we also use the alignment correlations as an additional objective to supervise generating visual frames. We conduct extensive experiments on several publicly available benchmarks (i.e., HDTF, LRW, VoxCeleb1, and VoxCeleb2) to demonstrate its superiority over existing leading algorithms.

Related papers

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment [76.32508013503653]
We propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning.<n>We tackle the mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations.<n>We improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens.
arXiv Detail & Related papers (2025-05-02T12:59:58Z)
Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation [39.38821481268827]
Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects that accurately align with the corresponding audio.<n>Current methods focus more on object-level information but neglect the boundaries of audio semantic changes, leading to temporal misalignment.<n>We propose a Collaborative Hybrid Propagator Framework(Co-Prop) to address this issue.
arXiv Detail & Related papers (2024-12-11T07:33:18Z)
Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding [33.85362137961572]
We introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering. We develop AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens.
arXiv Detail & Related papers (2024-03-24T19:50:49Z)
Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges. This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations. We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z)
STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment [61.83340833859382]
Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks. This is a nontemporal problem and poses two critical challenges: sparse-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. We propose a continual audio-video pre-training method with two novel ideas.
arXiv Detail & Related papers (2023-10-12T10:50:21Z)
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models [25.660343393359565]
This paper proposes a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal large language models (LLM) FAVOR simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon.
arXiv Detail & Related papers (2023-10-09T17:00:20Z)
Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework. First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes. Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z)
Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z)
Temporal and cross-modal attention for audio-visual zero-shot learning [38.02396786726476]
generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information. We propose a multi-modal and Temporal Cross-attention Framework (modelName) for audio-visual generalised zero-shot learning. We show that our proposed framework that ingests temporal features yields state-of-the-art performance on the ucf, vgg, and activity benchmarks for (generalised) zero-shot learning.
arXiv Detail & Related papers (2022-07-20T15:19:30Z)
Learning Spatial-Temporal Graphs for Active Speaker Detection [26.45877018368872]
SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We first construct a graph from a video so that each node corresponds to one person. We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
arXiv Detail & Related papers (2021-12-02T18:29:07Z)
AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information. We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z)
Learning Audio-Visual Correlations from Variational Cross-Modal Generation [35.07257471319274]
We learn the audio-visual correlations from the perspective of cross-modal generation in a self-supervised manner. The learned correlations can be readily applied in multiple downstream tasks such as the audio-visual cross-modal localization and retrieval.
arXiv Detail & Related papers (2021-02-05T21:27:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.