Video Semantic Segmentation with Inter-Frame Feature Fusion and
Inner-Frame Feature Refinement
- URL: http://arxiv.org/abs/2301.03832v1
- Date: Tue, 10 Jan 2023 07:57:05 GMT
- Title: Video Semantic Segmentation with Inter-Frame Feature Fusion and
Inner-Frame Feature Refinement
- Authors: Jiafan Zhuang, Zilei Wang, Junjie Li
- Abstract summary: We propose a spatial-temporal fusion (STF) module to model dense pairwise relationships among multi-frame features.
Besides, we propose a novel memory-augmented refinement (MAR) module to tackle difficult predictions among semantic boundaries.
- Score: 39.06589186472675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video semantic segmentation aims to generate accurate semantic maps for each
video frame. To this end, many works dedicate to integrate diverse information
from consecutive frames to enhance the features for prediction, where a feature
alignment procedure via estimated optical flow is usually required. However,
the optical flow would inevitably suffer from inaccuracy, and then introduce
noises in feature fusion and further result in unsatisfactory segmentation
results. In this paper, to tackle the misalignment issue, we propose a
spatial-temporal fusion (STF) module to model dense pairwise relationships
among multi-frame features. Different from previous methods, STF uniformly and
adaptively fuses features at different spatial and temporal positions, and
avoids error-prone optical flow estimation. Besides, we further exploit feature
refinement within a single frame and propose a novel memory-augmented
refinement (MAR) module to tackle difficult predictions among semantic
boundaries. Specifically, MAR can store the boundary features and prototypes
extracted from the training samples, which together form the task-specific
memory, and then use them to refine the features during inference. Essentially,
MAR can move the hard features closer to the most likely category and thus make
them more discriminative. We conduct extensive experiments on Cityscapes and
CamVid, and the results show that our proposed methods significantly outperform
previous methods and achieves the state-of-the-art performance. Code and
pretrained models are available at https://github.com/jfzhuang/ST_Memory.
Related papers
- MVP-Shot: Multi-Velocity Progressive-Alignment Framework for Few-Shot Action Recognition [36.426688592783975]
MVP-Shot is a framework to learn and align semantic-related action features at multi-velocity levels.
MVFA module measures similarity between features from support and query videos with different velocity scales.
PST module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains.
arXiv Detail & Related papers (2024-05-03T13:10:16Z) - ColorMNet: A Memory-based Deep Spatial-Temporal Feature Propagation Network for Video Colorization [62.751303924391564]
How to effectively explore spatial-temporal features is important for video colorization.
We develop a memory-based feature propagation module that can establish reliable connections with features from far-apart frames.
We develop a local attention module to aggregate the features from adjacent frames in a spatial-temporal neighborhood.
arXiv Detail & Related papers (2024-04-09T12:23:30Z) - Motion-Aware Video Frame Interpolation [49.49668436390514]
We introduce a Motion-Aware Video Frame Interpolation (MA-VFI) network, which directly estimates intermediate optical flow from consecutive frames.
It not only extracts global semantic relationships and spatial details from input frames with different receptive fields, but also effectively reduces the required computational cost and complexity.
arXiv Detail & Related papers (2024-02-05T11:00:14Z) - Semantic Diffusion Network for Semantic Segmentation [1.933681537640272]
We introduce an operator-level approach to enhance semantic boundary awareness.
We propose a novel learnable approach called semantic diffusion network (SDN)
Our SDN aims to construct a differentiable mapping from the original feature to the inter-class boundary-enhanced feature.
arXiv Detail & Related papers (2023-02-04T01:39:16Z) - Mining Relations among Cross-Frame Affinities for Video Semantic
Segmentation [87.4854250338374]
We explore relations among affinities in two aspects: single-scale intrinsic correlations and multi-scale relations.
Our experiments demonstrate that the proposed method performs favorably against state-of-the-art VSS methods.
arXiv Detail & Related papers (2022-07-21T12:12:36Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - MUNet: Motion Uncertainty-aware Semi-supervised Video Object
Segmentation [31.100954335785026]
We advocate the return of the emphmotion information and propose a motion uncertainty-aware framework (MUNet) for semi-supervised video object segmentation.
We introduce a motion-aware spatial attention module to effectively fuse the motion feature with the semantic feature.
We achieve $76.5%$ $mathcalJ & mathcalF$ only using DAVIS17 for training, which significantly outperforms the textitSOTA methods under the low-data protocol.
arXiv Detail & Related papers (2021-11-29T16:01:28Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.