STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing
- URL: http://arxiv.org/abs/2503.23039v1
- Date: Sat, 29 Mar 2025 11:04:10 GMT
- Title: STSA: Spatial-Temporal Semantic Alignment for Visual Dubbing
- Authors: Zijun Ding, Mingdie Xiong, Congcong Zhu, Jingrun Chen,
- Abstract summary: We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion.<n>We propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation.
- Score: 2.231167375820083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing audio-driven visual dubbing methods have achieved great success. Despite this, we observe that the semantic ambiguity between spatial and temporal domains significantly degrades the synthesis stability for the dynamic faces. We argue that aligning the semantic features from spatial and temporal domains is a promising approach to stabilizing facial motion. To achieve this, we propose a Spatial-Temporal Semantic Alignment (STSA) method, which introduces a dual-path alignment mechanism and a differentiable semantic representation. The former leverages a Consistent Information Learning (CIL) module to maximize the mutual information at multiple scales, thereby reducing the manifold differences between spatial and temporal domains. The latter utilizes probabilistic heatmap as ambiguity-tolerant guidance to avoid the abnormal dynamics of the synthesized faces caused by slight semantic jittering. Extensive experimental results demonstrate the superiority of the proposed STSA, especially in terms of image quality and synthesis stability. Pre-trained weights and inference code are available at https://github.com/SCAILab-USTC/STSA.
Related papers
- From Observations to States: Latent Time Series Forecasting [65.98504021691666]
We propose Latent Time Series Forecasting (LatentTSF), a novel paradigm that shifts TSF from observation regression to latent state prediction.<n>Specifically, LatentTSF employs an AutoEncoder to project observations at each time step into a higher-dimensional latent state space.<n>Our proposed latent objectives implicitly maximize mutual information between predicted latent states and ground-truth states and observations.
arXiv Detail & Related papers (2026-01-30T20:39:44Z) - Temporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis [9.998823710345919]
We propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction.<n>For every modality, a temporal encoder and a spatial spatial encoder signals into separate temporal and spatial body.<n>Factor-Consistent Cross-Modal Alignment aligns temporal features only with their temporal-specific counterparts across modalities, and spatial features only with their spatial counterparts.
arXiv Detail & Related papers (2026-01-20T06:50:40Z) - DETACH : Decomposed Spatio-Temporal Alignment for Exocentric Video and Ambient Sensors with Staged Learning [7.149401911329968]
Aligning egocentric with wearable sensors has shown promise for human action recognition, but face practical limitations in user discomfort, privacy concerns, and scalability.<n>We explore exocentric video with ambient sensors as a non-intrusive, scalable alternative.<n> Comprehensive experiments with downstream tasks on Opportunity++ and Hambi-USPWU datasets demonstrate substantial improvements over adapted egocentric-wearable baselines.
arXiv Detail & Related papers (2025-12-23T14:55:53Z) - TaCo: Capturing Spatio-Temporal Semantic Consistency in Remote Sensing Change Detection [54.22717266034045]
Ta-Co is a consistent semantic network for temporal semantic transitions.<n>We show that Ta-Co consistently achieves SOTA performance on remote sensing detection tasks.<n>This design can yield substantial gains without any additional computational overhead during inference.
arXiv Detail & Related papers (2025-11-25T13:44:29Z) - Semantic Concentration for Self-Supervised Dense Representations Learning [103.10708947415092]
Image-level self-supervised learning (SSL) has made significant progress, yet learning dense representations for patches remains challenging.<n>This work reveals that image-level SSL avoids over-dispersion by involving implicit semantic concentration.
arXiv Detail & Related papers (2025-09-11T13:12:10Z) - OptiCorNet: Optimizing Sequence-Based Context Correlation for Visual Place Recognition [2.3093110834423616]
This paper presents OptiCorNet, a novel sequence modeling framework.<n>It unifies spatial feature extraction and temporal differencing into a differentiable, end-to-end trainable module.<n>Our approach outperforms state-of-the-art baselines under challenging seasonal and viewpoint variations.
arXiv Detail & Related papers (2025-07-19T04:29:43Z) - Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [66.97034863216892]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z) - A theoretical framework for self-supervised contrastive learning for continuous dependent data [86.50780641055258]
Self-supervised learning (SSL) has emerged as a powerful approach to learning representations, particularly in the field of computer vision.<n>We propose a novel theoretical framework for contrastive SSL tailored to emphsemantic independence between samples.<n>Specifically, we outperform TS2Vec on the standard UEA and UCR benchmarks, with accuracy improvements of $4.17$% and $2.08$%, respectively.
arXiv Detail & Related papers (2025-06-11T14:23:47Z) - STaR: Seamless Spatial-Temporal Aware Motion Retargeting with Penetration and Consistency Constraints [12.307413108334657]
We propose a novel sequence-to-sequence model for spatial-temporal motion Retargeting (STaR)
STaR consists of two modules: (1) a spatial module that incorporates dense shape representation and a novel limb penetration constraint to ensure geometric plausibility while preserving motion semantics, and (2) a temporal module that utilizes a temporal transformer and a temporal consistency constraint to predict the entire motion sequence at once while enforcing multi-level trajectory smoothness.
arXiv Detail & Related papers (2025-04-09T00:37:08Z) - IPSeg: Image Posterior Mitigates Semantic Drift in Class-Incremental Segmentation [77.06177202334398]
We identify two critical challenges in CISS that contribute to semantic drift and degrade performance.
First, we highlight the issue of separate optimization, where different parts of the model are optimized in distinct incremental stages.
Second, we identify noisy semantics arising from inappropriate pseudo-labeling, which results in sub-optimal results.
arXiv Detail & Related papers (2025-02-07T12:19:37Z) - GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling [32.47567372398872]
GestureLSM is a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling.<n>It achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods.
arXiv Detail & Related papers (2025-01-31T05:34:59Z) - Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction [61.484280369655536]
Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations.<n>Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning.<n>We introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP)
arXiv Detail & Related papers (2024-12-11T09:53:10Z) - Precise Facial Landmark Detection by Dynamic Semantic Aggregation Transformer [29.484887366344363]
Deep neural network methods have played a dominant role in face alignment field.<n>We propose a Dynamic Semantic-Aggregation Transformer (DSAT) for more discriminative and representative feature learning.<n>Our proposed DSAT outperforms state-of-the-art models in the literature.
arXiv Detail & Related papers (2024-12-01T09:20:32Z) - Semantic-Guided Multimodal Sentiment Decoding with Adversarial Temporal-Invariant Learning [22.54577327204281]
Multimodal sentiment analysis aims to learn representations from different modalities to identify human emotions.
Existing works often neglect the frame-level redundancy inherent in continuous time series, resulting in incomplete modality representations with noise.
We propose temporal-invariant learning for the first time, which constrains the distributional variations over time steps to effectively capture long-term temporal dynamics.
arXiv Detail & Related papers (2024-08-30T03:28:40Z) - SSPA: Split-and-Synthesize Prompting with Gated Alignments for Multi-Label Image Recognition [71.90536979421093]
We propose a Split-and-Synthesize Prompting with Gated Alignments (SSPA) framework to amplify the potential of Vision-Language Models (VLMs)
We develop an in-context learning approach to associate the inherent knowledge from LLMs.
Then we propose a novel Split-and-Synthesize Prompting (SSP) strategy to first model the generic knowledge and downstream label semantics individually.
arXiv Detail & Related papers (2024-07-30T15:58:25Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - A Generically Contrastive Spatiotemporal Representation Enhancement for 3D Skeleton Action Recognition [10.403751563214113]
We propose a Contrastive Spatiotemporal Representation Enhancement (CSRE) framework to obtain more discriminative representations from the sequences.<n>Specifically, we decompose the representation into spatial- and temporal-specific features to explore fine-grained motion patterns.<n>To explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations.
arXiv Detail & Related papers (2023-12-23T02:54:41Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications.
Traditional methods rely on hand-crafted features and machine learning techniques.
We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z) - Spatio-Temporal Self-Supervised Learning for Traffic Flow Prediction [36.77135502344546]
We propose a novel Spatio-Supervised Learning (ST-SSL) traffic prediction framework.
Our ST-SSL is built over an integrated module with temporal spatial convolutions for encoding the information across space and time.
Experiments on four benchmark datasets demonstrate that ST-SSL consistently outperforms various state-of-the-art baselines.
arXiv Detail & Related papers (2022-12-07T10:02:01Z) - SLLEN: Semantic-aware Low-light Image Enhancement Network [92.80325772199876]
We develop a semantic-aware LLE network (SSLEN) composed of a LLE main-network (LLEmN) and a SS auxiliary-network (SSaN)
Unlike currently available approaches, the proposed SLLEN is able to fully lever the semantic information, e.g., IEF, HSF, and SS dataset, to assist LLE.
Comparisons between the proposed SLLEN and other state-of-the-art techniques demonstrate the superiority of SLLEN with respect to LLE quality.
arXiv Detail & Related papers (2022-11-21T15:29:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.