Learning Local and Global Temporal Contexts for Video Semantic Segmentation
- URL: http://arxiv.org/abs/2204.03330v2
- Date: Tue, 9 Apr 2024 15:44:05 GMT
- Title: Learning Local and Global Temporal Contexts for Video Semantic Segmentation
- Authors: Guolei Sun, Yun Liu, Henghui Ding, Min Wu, Luc Van Gool,
- Abstract summary: Contextual information plays a core role for video semantic segmentation (VSS)
This paper summarizes contexts for VSS in two-fold: local temporal contexts (LTC) and global temporal contexts (GTC)
We propose a Coarse-to-Fine Feature Mining (CFFM) technique to learn a unified presentation of LTC.
- Score: 80.01394521812969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual information plays a core role for video semantic segmentation (VSS). This paper summarizes contexts for VSS in two-fold: local temporal contexts (LTC) which define the contexts from neighboring frames, and global temporal contexts (GTC) which represent the contexts from the whole video. As for LTC, it includes static and motional contexts, corresponding to static and moving content in neighboring frames, respectively. Previously, both static and motional contexts have been studied. However, there is no research about simultaneously learning static and motional contexts (highly complementary). Hence, we propose a Coarse-to-Fine Feature Mining (CFFM) technique to learn a unified presentation of LTC. CFFM contains two parts: Coarse-to-Fine Feature Assembling (CFFA) and Cross-frame Feature Mining (CFM). CFFA abstracts static and motional contexts, and CFM mines useful information from nearby frames to enhance target features. To further exploit more temporal contexts, we propose CFFM++ by additionally learning GTC from the whole video. Specifically, we uniformly sample certain frames from the video and extract global contextual prototypes by k-means. The information within those prototypes is mined by CFM to refine target features. Experimental results on popular benchmarks demonstrate that CFFM and CFFM++ perform favorably against state-of-the-art methods. Our code is available at https://github.com/GuoleiSun/VSS-CFFM
Related papers
- AFANet: Adaptive Frequency-Aware Network for Weakly-Supervised Few-Shot Semantic Segmentation [37.9826204492371]
Few-shot learning aims to recognize novel concepts by leveraging prior knowledge learned from a few samples.
We propose an adaptive frequency-aware network (AFANet) for weakly-supervised few-shot semantic segmentation.
arXiv Detail & Related papers (2024-12-23T14:20:07Z) - Leveraging Temporal Contextualization for Video Action Recognition [47.8361303269338]
We propose a framework for video understanding called Temporally Contextualized CLIP (TC-CLIP)
We introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos.
The Video-Prompting (VP) module processes context tokens to generate informative prompts in the text modality.
arXiv Detail & Related papers (2024-04-15T06:24:56Z) - C2F-TCN: A Framework for Semi and Fully Supervised Temporal Action
Segmentation [20.182928938110923]
Temporal action segmentation tags action labels for every frame in an input untrimmed video containing multiple actions in a sequence.
We propose an encoder-decoder-style architecture named C2F-TCN featuring a "coarse-to-fine" ensemble of decoder outputs.
We show that the architecture is flexible for both supervised and representation learning.
arXiv Detail & Related papers (2022-12-20T14:53:46Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Unsupervised Temporal Video Grounding with Deep Semantic Clustering [58.95918952149763]
Temporal video grounding aims to localize a target segment in a video according to a given sentence query.
In this paper, we explore whether a video grounding model can be learned without any paired annotations.
Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set.
arXiv Detail & Related papers (2022-01-14T05:16:33Z) - Flow-Guided Sparse Transformer for Video Deblurring [124.11022871999423]
FlowGuided Sparse Transformer (F GST) is a framework for video deblurring.
FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse elements corresponding to the same scene patch in neighboring frames.
Our proposed F GST outperforms state-of-the-art patches on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring.
arXiv Detail & Related papers (2022-01-06T02:05:32Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.