Time2General: Learning Spatiotemporal Invariant Representations for Domain-Generalization Video Semantic Segmentation
- URL: http://arxiv.org/abs/2602.09648v1
- Date: Tue, 10 Feb 2026 10:55:25 GMT
- Title: Time2General: Learning Spatiotemporal Invariant Representations for Domain-Generalization Video Semantic Segmentation
- Authors: Siyu Chen, Ting Han, Haoling Huang, Chaolei Wang, Chengzheng Fu, Duxin Zhu, Guorong Cai, Jinhe Su,
- Abstract summary: Domain Generalized Video Semantic (DGVSS) is trained on a single labeled driving domain.<n>Time2General achieves a substantial improvement in cross-domain accuracy and temporal stability over prior DGVSS and VSS baselines.
- Score: 9.929390581043334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test-time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal-sampling shift break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio-Temporal Memory Decoder that aggregates multi-frame context into a clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to regularize temporal prediction discrepancies across different strides, and randomize training strides to expose the model to diverse temporal gaps. Extensive experiments on multiple driving benchmarks show that Time2General achieves a substantial improvement in cross-domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS. Code will be released after the review process.
Related papers
- MEMTS: Internalizing Domain Knowledge via Parameterized Memory for Retrieval-Free Domain Adaptation of Time Series Foundation Models [51.506429027626005]
Memory for Time Series (MEMTS) is a lightweight and plug-and-play method for retrieval-free domain adaptation in time series forecasting.<n>Key component of MEMTS is a Knowledge Persistence Module (KPM), which internalizes domain-specific temporal dynamics.<n>This paradigm shift enables MEMTS to achieve accurate domain adaptation with constant-time inference and near-zero latency.
arXiv Detail & Related papers (2026-02-14T14:00:06Z) - A Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness [8.202209362704494]
We propose a Dual-Branch Framework for Semantic Change Detection with Boundary and Temporal Awareness, termed ANet.<n>ANet integrates global semantics, local details, temporal reasoning, and boundary awareness, achieving state-of-the-art performance.
arXiv Detail & Related papers (2026-02-12T00:54:22Z) - E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching [87.38371267983263]
Temporal Video Grounding aims to precisely localize time segments corresponding to query events.<n>E.M.Ground is a novel Vid-LLM for TVG that focuses on holistic and coherent event perception.<n>E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.
arXiv Detail & Related papers (2026-02-05T02:16:00Z) - Morphing Through Time: Diffusion-Based Bridging of Temporal Gaps for Robust Alignment in Change Detection [51.56484100374058]
We introduce a modular pipeline that improves spatial and temporal robustness without altering existing change detection networks.<n>A diffusion module synthesizes intermediate morphing frames that bridge large appearance gaps, enabling RoMa to estimate stepwise correspondences.<n>Experiments on LEVIR-CD, WHU-CD, and DSIFN-CD show consistent gains in both registration accuracy and downstream change detection.
arXiv Detail & Related papers (2025-11-11T08:40:28Z) - TimeMosaic: Temporal Heterogeneity Guided Time Series Forecasting via Adaptive Granularity Patch and Segment-wise Decoding [3.64798801374117]
TimeMosaic is a forecasting framework that aims to address temporal heterogeneity.<n>TimeMosaic employs adaptive patch embedding to dynamically adjust granularity according to local information density.<n>Our model trained on the large-scale corpus with 321 billion observations achieves performance competitive with state-of-the-art TSFMs.
arXiv Detail & Related papers (2025-09-23T09:20:00Z) - Improving Weakly Supervised Temporal Action Localization by Exploiting Multi-resolution Information in Temporal Domain [84.73693644211596]
We propose a two-stage approach to fully exploit multi-resolution information in the temporal domain.<n>In the first stage, we generate reliable initial frame-level pseudo labels based on both appearance and motion streams.<n>In the second stage, we iteratively refine the pseudo labels and use a set of selected frames with highly confident pseudo labels to train neural networks.
arXiv Detail & Related papers (2025-06-23T03:20:18Z) - Temporally Consistent Referring Video Object Segmentation with Hybrid Memory [98.80249255577304]
We propose an end-to-end R-VOS paradigm that explicitly models temporal consistency alongside the referring segmentation.
Features of frames with automatically generated high-quality reference masks are propagated to segment remaining frames.
Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin.
arXiv Detail & Related papers (2024-03-28T13:32:49Z) - Local-Global Temporal Difference Learning for Satellite Video Super-Resolution [53.03380679343968]
We propose to exploit the well-defined temporal difference for efficient and effective temporal compensation.<n>To fully utilize the local and global temporal information within frames, we systematically modeled the short-term and long-term temporal discrepancies.<n> Rigorous objective and subjective evaluations conducted across five mainstream video satellites demonstrate that our method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2023-04-10T07:04:40Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Domain Adaptive Video Segmentation via Temporal Consistency
Regularization [32.77436219094282]
This paper presents DA-VSN, a domain adaptive video segmentation network that addresses domain gaps in videos by temporal consistency regularization (TCR)
The first is cross-domain TCR that guides the prediction of target frames to have similar temporal consistency as that of source frames (learnt from annotated source data) via adversarial learning.
The second is intra-domain TCR that guides unconfident predictions of target frames to have similar temporal consistency as confident predictions of target frames.
arXiv Detail & Related papers (2021-07-23T02:50:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.