Motion-state Alignment for Video Semantic Segmentation
- URL: http://arxiv.org/abs/2304.08820v1
- Date: Tue, 18 Apr 2023 08:34:46 GMT
- Title: Motion-state Alignment for Video Semantic Segmentation
- Authors: Jinming Su, Ruihong Yin, Shuaibin Zhang and Junfeng Luo
- Abstract summary: We propose a novel motion-state alignment framework for video semantic segmentation.
The proposed method picks up dynamic and static semantics in a targeted way.
Experiments on Cityscapes and CamVid datasets show that the proposed approach outperforms state-of-the-art methods.
- Score: 4.375012768093524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, video semantic segmentation has made great progress with
advanced deep neural networks. However, there still exist two main challenges
\ie, information inconsistency and computation cost. To deal with the two
difficulties, we propose a novel motion-state alignment framework for video
semantic segmentation to keep both motion and state consistency. In the
framework, we first construct a motion alignment branch armed with an efficient
decoupled transformer to capture dynamic semantics, guaranteeing region-level
temporal consistency. Then, a state alignment branch composed of a stage
transformer is designed to enrich feature spaces for the current frame to
extract static semantics and achieve pixel-level state consistency. Next, by a
semantic assignment mechanism, the region descriptor of each semantic category
is gained from dynamic semantics and linked with pixel descriptors from static
semantics. Benefiting from the alignment of these two kinds of effective
information, the proposed method picks up dynamic and static semantics in a
targeted way, so that video semantic regions are consistently segmented to
obtain precise locations with low computational complexity. Extensive
experiments on Cityscapes and CamVid datasets show that the proposed approach
outperforms state-of-the-art methods and validates the effectiveness of the
motion-state alignment framework.
Related papers
- Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation [52.337472185022136]
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description.
We propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation.
We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art consistency.
arXiv Detail & Related papers (2025-01-06T14:49:26Z) - Static-Dynamic Class-level Perception Consistency in Video Semantic Segmentation [9.964615076037397]
Video semantic segmentation (VSS) has been widely employed in lots of fields, such as simultaneous localization and mapping.
Previous efforts have primarily focused on pixel-level static-dynamic contexts matching.
This paper rethinks static-dynamic contexts at the class level and proposes a novel static-dynamic class-level perceptual consistency framework.
arXiv Detail & Related papers (2024-12-11T02:29:51Z) - Self-Supervised Video Representation Learning in a Heuristic Decoupled Perspective [10.938290904843939]
We propose "Bi-level Optimization of Learning Dynamic with Decoupling and Intervention" (BOLD-DI) to capture both static and dynamic semantics in a decoupled manner.
Our method can be seamlessly integrated into the existing v-CL methods and experimental results highlight the significant improvements.
arXiv Detail & Related papers (2024-07-19T06:53:54Z) - Context Propagation from Proposals for Semantic Video Object Segmentation [1.223779595809275]
We propose a novel approach to learning semantic contextual relationships in videos for semantic object segmentation.
Our proposals derives the semantic contexts from video object which encode the key evolution of objects and the relationship among objects over semantic-temporal domain.
arXiv Detail & Related papers (2024-07-08T14:44:18Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - Alignment-free HDR Deghosting with Semantics Consistent Transformer [76.91669741684173]
High dynamic range imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output.
Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion.
We propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules.
arXiv Detail & Related papers (2023-05-29T15:03:23Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Dynamic Dual Sampling Module for Fine-Grained Semantic Segmentation [27.624291416260185]
We propose a Dynamic Dual Sampling Module (DDSM) to conduct dynamic affinity modeling and propagate semantic context to local details.
Experiment results on both City and Camvid datasets validate the effectiveness and efficiency of the proposed approach.
arXiv Detail & Related papers (2021-05-25T04:25:47Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.