Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation
- URL: http://arxiv.org/abs/2412.08034v1
- Date: Wed, 11 Dec 2024 02:29:51 GMT
- Title: Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation
- Authors: Zhigang Cen, Ningyan Guo, Wenjing Xu, Zhiyong Feng, Danlan Huang,
- Abstract summary: Video semantic segmentation (VSS) has been widely employed in lots of fields, such as simultaneous localization and mapping.
Previous efforts have primarily focused on pixel-level static-dynamic contexts matching.
This paper rethinks static-dynamic contexts at the class level and proposes a novel static-dynamic class-level perceptual consistency framework.
- Score: 9.964615076037397
- License:
- Abstract: Video semantic segmentation(VSS) has been widely employed in lots of fields, such as simultaneous localization and mapping, autonomous driving and surveillance. Its core challenge is how to leverage temporal information to achieve better segmentation. Previous efforts have primarily focused on pixel-level static-dynamic contexts matching, utilizing techniques such as optical flow and attention mechanisms. Instead, this paper rethinks static-dynamic contexts at the class level and proposes a novel static-dynamic class-level perceptual consistency (SD-CPC) framework. In this framework, we propose multivariate class prototype with contrastive learning and a static-dynamic semantic alignment module. The former provides class-level constraints for the model, obtaining personalized inter-class features and diversified intra-class features. The latter first establishes intra-frame spatial multi-scale and multi-level correlations to achieve static semantic alignment. Then, based on cross-frame static perceptual differences, it performs two-stage cross-frame selective aggregation to achieve dynamic semantic alignment. Meanwhile, we propose a window-based attention map calculation method that leverages the sparsity of attention points during cross-frame aggregation to reduce computation cost. Extensive experiments on VSPW and Cityscapes datasets show that the proposed approach outperforms state-of-the-art methods. Our implementation will be open-sourced on GitHub.
Related papers
- Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective [10.938290904843939]
We propose "Bi-level Optimization of Learning Dynamic with Decoupling and Intervention" (BOLD-DI) to capture both static and dynamic semantics in a decoupled manner.
Our method can be seamlessly integrated into the existing v-CL methods and experimental results highlight the significant improvements.
arXiv Detail & Related papers (2024-07-19T06:53:54Z) - Mamba-FSCIL: Dynamic Adaptation with Selective State Space Model for Few-Shot Class-Incremental Learning [113.89327264634984]
Few-shot class-incremental learning (FSCIL) confronts the challenge of integrating new classes into a model with minimal training samples.
Traditional methods widely adopt static adaptation relying on a fixed parameter space to learn from data that arrive sequentially.
We propose a dual selective SSM projector that dynamically adjusts the projection parameters based on the intermediate features for dynamic adaptation.
arXiv Detail & Related papers (2024-07-08T17:09:39Z) - Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation [126.12940972028012]
We present HVC, a framework for self-supervised video object segmentation.
HVC extracts pseudo-dynamic signals from static images, enabling an efficient and scalable VOS model.
We propose a hybrid visual correspondence loss to learn joint static and dynamic consistency representations.
arXiv Detail & Related papers (2024-04-21T02:21:30Z) - Prompt-Driven Dynamic Object-Centric Learning for Single Domain
Generalization [61.64304227831361]
Single-domain generalization aims to learn a model from single source domain data to achieve generalized performance on other unseen target domains.
We propose a dynamic object-centric perception network based on prompt learning, aiming to adapt to the variations in image complexity.
arXiv Detail & Related papers (2024-02-28T16:16:51Z) - Motion-state Alignment for Video Semantic Segmentation [4.375012768093524]
We propose a novel motion-state alignment framework for video semantic segmentation.
The proposed method picks up dynamic and static semantics in a targeted way.
Experiments on Cityscapes and CamVid datasets show that the proposed approach outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-04-18T08:34:46Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - STVGFormer: Spatio-Temporal Video Grounding with Static-Dynamic
Cross-Modal Understanding [68.96574451918458]
We propose a framework named STVG, which models visual-linguistic dependencies with a static branch and a dynamic branch.
Both the static and dynamic branches are designed as cross-modal transformers.
Our proposed method achieved 39.6% vIoU and won the first place in the HC-STVG of the Person in Context Challenge.
arXiv Detail & Related papers (2022-07-06T15:48:58Z) - Multi-modal Visual Place Recognition in Dynamics-Invariant Perception
Space [23.43468556831308]
This letter explores the use of multi-modal fusion of semantic and visual modalities to improve place recognition in dynamic environments.
We achieve this by first designing a novel deep learning architecture to generate the static semantic segmentation.
We then innovatively leverage the spatial-pyramid-matching model to encode the static semantic segmentation into feature vectors.
In parallel, the static image is encoded using the popular Bag-of-words model.
arXiv Detail & Related papers (2021-05-17T13:14:52Z) - ClusterVO: Clustering Moving Instances and Estimating Visual Odometry
for Self and Surroundings [54.33327082243022]
ClusterVO is a stereo Visual Odometry which simultaneously clusters and estimates the motion of both ego and surrounding rigid clusters/objects.
Unlike previous solutions relying on batch input or imposing priors on scene structure or dynamic object models, ClusterVO is online, general and thus can be used in various scenarios including indoor scene understanding and autonomous driving.
arXiv Detail & Related papers (2020-03-29T09:06:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.