ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
- URL: http://arxiv.org/abs/2512.19546v1
- Date: Mon, 22 Dec 2025 16:28:27 GMT
- Title: ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
- Authors: Ziqiao Peng, Yi Chen, Yifeng Ma, Guozhen Zhang, Zhiyao Sun, Zixiang Zhou, Youliang Zhang, Zhengguang Zhou, Zhaoxin Fan, Hongyan Liu, Yuan Zhou, Qinglin Lu, Jun He,
- Abstract summary: ActAvatar is a framework that achieves phase-level precision in action control through textual guidance.<n>Phase-Aware Cross-Attention (PACA) decomposes prompts into a global base block and temporally-anchored phase blocks.<n>Progressive Audio-Visual Alignment aligns modality influence with the hierarchical feature learning process.<n>Two-stage training strategy injects action control through fine-tuning on structured annotations.
- Score: 28.337100940626573
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite significant advances in talking avatar generation, existing methods face critical challenges: insufficient text-following capability for diverse actions, lack of temporal alignment between actions and audio content, and dependency on additional control signals such as pose skeletons. We present ActAvatar, a framework that achieves phase-level precision in action control through textual guidance by capturing both action semantics and temporal context. Our approach introduces three core innovations: (1) Phase-Aware Cross-Attention (PACA), which decomposes prompts into a global base block and temporally-anchored phase blocks, enabling the model to concentrate on phase-relevant tokens for precise temporal-semantic alignment; (2) Progressive Audio-Visual Alignment, which aligns modality influence with the hierarchical feature learning process-early layers prioritize text for establishing action structure while deeper layers emphasize audio for refining lip movements, preventing modality interference; (3) A two-stage training strategy that first establishes robust audio-visual correspondence on diverse data, then injects action control through fine-tuning on structured annotations, maintaining both audio-visual alignment and the model's text-following capabilities. Extensive experiments demonstrate that ActAvatar significantly outperforms state-of-the-art methods in both action control and visual quality.
Related papers
- Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization [66.80402022104074]
Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (textiti.e., labeling only a single frame per action instance) to train a model to locate action instances within unsupervised videos.<n>Most existing approaches design the task head of models with only a point-trimmed snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action.<n>We propose a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization.
arXiv Detail & Related papers (2026-02-05T14:46:21Z) - 3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control [3.606473077857744]
3DGesPolicy is an action-based framework that reformulates holistic gesture generation as a continuous trajectory control problem.<n>By modeling frame-to-frame variations as unified holistic actions, our method effectively learns inter-frame holistic gesture motion patterns.<n>To further bridge the gap in expressive alignment, we propose a Gesture-Audio-Phoneme (GAP) fusion module.
arXiv Detail & Related papers (2026-01-26T12:57:36Z) - CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation [44.84719308595376]
CoordSpeaker is a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis.<n>Our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions.
arXiv Detail & Related papers (2025-11-28T03:38:08Z) - Zero-Shot Open-Vocabulary Human Motion Grounding with Test-Time Training [39.7658823121591]
ZOMG is a framework that segments motion sequences into semantically meaningful sub-actions without requiring any annotations or fine-tuning.<n>ZOMG integrates (1) language semantic partition, which leverages large language models to decompose instructions into ordered sub-action units, and (2) soft masking optimization.<n>Experiments on three motion-language datasets demonstrate state-of-the-art effectiveness and efficiency of motion grounding performance, outperforming prior methods by +8.7% mAP on HumanML3D benchmark.
arXiv Detail & Related papers (2025-11-19T12:11:36Z) - Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - Co-Speech Gesture Detection through Multi-Phase Sequence Labeling [3.924524252255593]
We introduce a novel framework that reframes the task as a multi-phase sequence labeling problem.
We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues.
arXiv Detail & Related papers (2023-08-21T12:27:18Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Emphasis control for parallel neural TTS [8.039245267912511]
The semantic information conveyed by a speech signal is strongly influenced by local variations in prosody.
Recent parallel neural text-to-speech (TTS) methods are able to generate speech with high fidelity while maintaining high performance.
This paper proposes a hierarchical parallel neural TTS system for prosodic emphasis control by learning a latent space that directly corresponds to a change in emphasis.
arXiv Detail & Related papers (2021-10-06T18:45:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.