Temporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis
- URL: http://arxiv.org/abs/2601.13659v1
- Date: Tue, 20 Jan 2026 06:50:40 GMT
- Title: Temporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis
- Authors: Chunlei Meng, Ziyang Zhou, Lucas He, Xiaojing Du, Chun Ouyang, Zhongxue Gan,
- Abstract summary: We propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction.<n>For every modality, a temporal encoder and a spatial spatial encoder signals into separate temporal and spatial body.<n>Factor-Consistent Cross-Modal Alignment aligns temporal features only with their temporal-specific counterparts across modalities, and spatial features only with their spatial counterparts.
- Score: 9.998823710345919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Sentiment Analysis integrates Linguistic, Visual, and Acoustic. Mainstream approaches based on modality-invariant and modality-specific factorization or on complex fusion still rely on spatiotemporal mixed modeling. This ignores spatiotemporal heterogeneity, leading to spatiotemporal information asymmetry and thus limited performance. Hence, we propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction. For every modality, a temporal encoder and a spatial encoder project signals into separate temporal and spatial body. Factor-Consistent Cross-Modal Alignment then aligns temporal features only with their temporal counterparts across modalities, and spatial features only with their spatial counterparts. Factor specific supervision and decorrelation regularization reduce cross factor leakage while preserving complementarity. A Gated Recouple module subsequently recouples the aligned streams for task. Extensive experiments show that TSDA outperforms baselines. Ablation analysis studies confirm the necessity and interpretability of the design.
Related papers
- Parallel Complex Diffusion for Scalable Time Series Generation [50.01609741902786]
PaCoDi is a spectral-native architecture that decouples generative modeling in the frequency domain.<n>We show that PaCoDi outperforms existing baselines in both generation quality and inference speed.
arXiv Detail & Related papers (2026-02-10T14:31:53Z) - Unleashing Temporal Capacity of Spiking Neural Networks through Spatiotemporal Separation [67.69345363409835]
Spiking Neural Networks (SNNs) are considered naturally suited for temporal processing, with membrane potential propagation widely regarded as the core temporal modeling mechanism.<n>We design Non-Stateful (NS) models progressively removing membrane propagation to its stage-wise role. Experiments reveal a counterintuitive phenomenon: moderate removal in shallow layers improves performance, while excessive removal causes collapse.
arXiv Detail & Related papers (2025-12-05T07:05:53Z) - TaCo: Capturing Spatio-Temporal Semantic Consistency in Remote Sensing Change Detection [54.22717266034045]
Ta-Co is a consistent semantic network for temporal semantic transitions.<n>We show that Ta-Co consistently achieves SOTA performance on remote sensing detection tasks.<n>This design can yield substantial gains without any additional computational overhead during inference.
arXiv Detail & Related papers (2025-11-25T13:44:29Z) - Multivariate Long-term Time Series Forecasting with Fourier Neural Filter [42.60778405812048]
We introduce FNF as the backbone and DBD as architecture to provide excellent learning capabilities and optimal learning pathways for spatial-temporal modeling.<n>We show that FNF unifies local time-domain and global frequency-domain information processing within a single backbone that extends naturally to spatial modeling.
arXiv Detail & Related papers (2025-06-10T18:40:20Z) - Reduced Spatial Dependency for More General Video-level Deepfake Detection [9.51656628987442]
We propose a novel method called Spatial Dependency Reduction (SDR), which integrates common temporal consistency features from multiple spatially-perturbed clusters.<n>Extensive benchmarks and ablation studies demonstrate the effectiveness and rationale of our approach.
arXiv Detail & Related papers (2025-03-05T08:51:55Z) - Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting [16.782154479264126]
Predicting backbone-temporal traffic flow presents challenges due to complex interactions between temporal factors.
Existing approaches address these dimensions in isolation, neglecting their critical interdependencies.
In this paper, we introduce Sanonymous-Temporal Unitized Unitized Cell (ASTUC), a unified framework designed to capture both spatial and temporal dependencies.
arXiv Detail & Related papers (2024-11-14T07:34:31Z) - SFTformer: A Spatial-Frequency-Temporal Correlation-Decoupling
Transformer for Radar Echo Extrapolation [15.56594998349013]
The spatial morphology and temporal evolution of radar echoes exhibit a certain degree of correlation, yet they also possess independent characteristics.
To effectively model the dynamics of radar echoes, we propose a Spatial-Frequency-Temporal correlation-decoupling Transformer (SFTformer)
Experimental results on the HKO-7 and ChinaNorth-2021 dataset demonstrate the superior performance of SFTfomer in short (1h), mid (2h), and long-term (3h) precipitation nowcasting.
arXiv Detail & Related papers (2024-02-28T04:43:41Z) - A Decoupled Spatio-Temporal Framework for Skeleton-based Action
Segmentation [89.86345494602642]
Existing methods are limited in weak-temporal modeling capability.
We propose a Decoupled Scoupled Framework (DeST) to address the issues.
DeST significantly outperforms current state-of-the-art methods with less computational complexity.
arXiv Detail & Related papers (2023-12-10T09:11:39Z) - Spatio-temporal Diffusion Point Processes [23.74522530140201]
patio-temporal point process (STPP) is a collection of events accompanied with time and space.
The failure to model the joint distribution leads to limited capacities in characterizing the pasthua-temporal interactions given events.
We propose a novel parameterization framework, which learns complex spatial-temporal joint distributions.
Our framework outperforms the state-of-the-art baselines remarkably, with an average improvement over 50%.
arXiv Detail & Related papers (2023-05-21T08:53:00Z) - Supporting Optimal Phase Space Reconstructions Using Neural Network
Architecture for Time Series Modeling [68.8204255655161]
We propose an artificial neural network with a mechanism to implicitly learn the phase spaces properties.
Our approach is either as competitive as or better than most state-of-the-art strategies.
arXiv Detail & Related papers (2020-06-19T21:04:47Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.