TaCo: Capturing Spatio-Temporal Semantic Consistency in Remote Sensing Change Detection
- URL: http://arxiv.org/abs/2511.20306v1
- Date: Tue, 25 Nov 2025 13:44:29 GMT
- Title: TaCo: Capturing Spatio-Temporal Semantic Consistency in Remote Sensing Change Detection
- Authors: Han Guo, Chenyang Liu, Haotian Zhang, Bowen Chen, Zhengxia Zou, Zhenwei Shi,
- Abstract summary: Ta-Co is a consistent semantic network for temporal semantic transitions.<n>We show that Ta-Co consistently achieves SOTA performance on remote sensing detection tasks.<n>This design can yield substantial gains without any additional computational overhead during inference.
- Score: 54.22717266034045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Remote sensing change detection (RSCD) aims to identify surface changes across bi-temporal satellite images. Most previous methods rely solely on mask supervision, which effectively guides spatial localization but provides limited constraints on the temporal semantic transitions. Consequently, they often produce spatially coherent predictions while still suffering from unresolved semantic inconsistencies. To address this limitation, we propose TaCo, a spatio-temporal semantic consistent network, which enriches the existing mask-supervised framework with a spatio-temporal semantic joint constraint. TaCo conceptualizes change as a semantic transition between bi-temporal states, in which one temporal feature representation can be derived from the other via dedicated transition features. To realize this, we introduce a Text-guided Transition Generator that integrates textual semantics with bi-temporal visual features to construct the cross-temporal transition features. In addition, we propose a spatio-temporal semantic joint constraint consisting of bi-temporal reconstruct constraints and a transition constraint: the former enforces alignment between reconstructed and original features, while the latter enhances discrimination for changes. This design can yield substantial performance gains without introducing any additional computational overhead during inference. Extensive experiments on six public datasets, spanning both binary and semantic change detection tasks, demonstrate that TaCo consistently achieves SOTA performance.
Related papers
- A Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness [8.202209362704494]
We propose a Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness, termed ANet.<n>ANet integrates global semantics, local details, temporal reasoning, and boundary awareness, achieving state-of-the-art performance.
arXiv Detail & Related papers (2026-02-12T00:54:22Z) - Temporal-Spatial Decouple before Act: Disentangled Representation Learning for Multimodal Sentiment Analysis [9.998823710345919]
We propose TSDA, Temporal-Spatial Decouple before Act, which explicitly decouples each modality into temporal dynamics and spatial structural context before any interaction.<n>For every modality, a temporal encoder and a spatial spatial encoder signals into separate temporal and spatial body.<n>Factor-Consistent Cross-Modal Alignment aligns temporal features only with their temporal-specific counterparts across modalities, and spatial features only with their spatial counterparts.
arXiv Detail & Related papers (2026-01-20T06:50:40Z) - Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [131.33758144860988]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z) - Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction [61.484280369655536]
Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations.<n>Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning.<n>We introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP)
arXiv Detail & Related papers (2024-12-11T09:53:10Z) - A Late-Stage Bitemporal Feature Fusion Network for Semantic Change Detection [32.112311027857636]
We propose a novel late-stage bitemporal feature fusion network to address the issue of semantic change detection.
Specifically, we propose local global attentional aggregation module to strengthen feature fusion, and propose local global context enhancement module to highlight pivotal semantics.
Our proposed model achieves new state-of-the-art performance on both datasets.
arXiv Detail & Related papers (2024-06-15T16:02:10Z) - Unified Domain Adaptive Semantic Segmentation [105.05235403072021]
Unsupervised Adaptive Domain Semantic (UDA-SS) aims to transfer the supervision from a labeled source domain to an unlabeled target domain.<n>We propose a Quad-directional Mixup (QuadMix) method, characterized by tackling distinct point attributes and feature inconsistencies.<n>Our method outperforms the state-of-the-art works by large margins on four challenging UDA-SS benchmarks.
arXiv Detail & Related papers (2023-11-22T09:18:49Z) - Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence.
We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps.
We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z) - Joint Spatio-Temporal Modeling for the Semantic Change Detection in
Remote Sensing Images [22.72105435238235]
We propose a Semantic Change (SCanFormer) to explicitly model the 'from-to' semantic transitions between the bi-temporal RSIss.
Then, we introduce a semantic learning scheme to leverage the Transformer-temporal constraints, which are coherent to the SCD task, to guide the learning of semantic changes.
The resulting network (SCanNet) outperforms the baseline method in terms of both detection of critical semantic changes and semantic consistency in the obtained bi-temporal results.
arXiv Detail & Related papers (2022-12-10T08:49:19Z) - Spatiotemporal Multi-scale Bilateral Motion Network for Gait Recognition [3.1240043488226967]
In this paper, motivated by optical flow, the bilateral motion-oriented features are proposed.
We develop a set of multi-scale temporal representations that force the motion context to be richly described at various levels of temporal resolution.
arXiv Detail & Related papers (2022-09-26T01:36:22Z) - Bi-Temporal Semantic Reasoning for the Semantic Change Detection of HR
Remote Sensing Images [17.53683781109742]
We propose a novel CNN architecture for semantic change detection (SCD)
We elaborate on this architecture to model the bi-temporal semantic correlations.
The resulting Bi-temporal Semantic Reasoning Network (Bi-SRNet) contains two types of semantic reasoning blocks to reason both single-temporal and cross-temporal semantic correlations.
arXiv Detail & Related papers (2021-08-13T07:28:09Z) - Weakly Supervised Temporal Adjacent Network for Language Grounding [96.09453060585497]
We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding.
WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm.
An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
arXiv Detail & Related papers (2021-06-30T15:42:08Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.