Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation
- URL: http://arxiv.org/abs/2309.08020v1
- Date: Thu, 14 Sep 2023 20:31:06 GMT
- Title: Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation
- Authors: Zhaochong An, Guolei Sun, Zongwei Wu, Hao Tang, Luc Van Gool
- Abstract summary: Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
- Score: 62.275143240798236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern approaches have proved the huge potential of addressing semantic
segmentation as a mask classification task which is widely used in
instance-level segmentation. This paradigm trains models by assigning part of
object queries to ground truths via conventional one-to-one matching. However,
we observe that the popular video semantic segmentation (VSS) dataset has
limited categories per video, meaning less than 10% of queries could be matched
to receive meaningful gradient updates during VSS training. This inefficiency
limits the full expressive potential of all queries.Thus, we present a novel
solution THE-Mask for VSS, which introduces temporal-aware hierarchical object
queries for the first time. Specifically, we propose to use a simple two-round
matching mechanism to involve more queries matched with minimal cost during
training while without any extra cost during inference. To support our
more-to-one assignment, in terms of the matching results, we further design a
hierarchical loss to train queries with their corresponding hierarchy of
primary or secondary. Moreover, to effectively capture temporal information
across frames, we propose a temporal aggregation decoder that fits seamlessly
into the mask-classification paradigm for VSS. Utilizing temporal-sensitive
multi-level queries, our method achieves state-of-the-art performance on the
latest challenging VSS benchmark VSPW without bells and whistles.
Related papers
- Hierarchical Multimodal LLMs with Semantic Space Alignment for Enhanced Time Series Classification [4.5939667818289385]
HiTime is a hierarchical multi-modal model that seamlessly integrates temporal information into large language models.
Our findings highlight the potential of integrating temporal features into LLMs, paving the way for advanced time series analysis.
arXiv Detail & Related papers (2024-10-24T12:32:19Z) - Frequency-based Matcher for Long-tailed Semantic Segmentation [22.199174076366003]
We focus on a relatively under-explored task setting, long-tailed semantic segmentation (LTSS)
We propose a dual-metric evaluation system and construct the LTSS benchmark to demonstrate the performance of semantic segmentation methods and long-tailed solutions.
We also propose a transformer-based algorithm to improve LTSS, frequency-based matcher, which solves the oversuppression problem by one-to-many matching.
arXiv Detail & Related papers (2024-06-06T09:57:56Z) - LaSagnA: Language-based Segmentation Assistant for Complex Queries [39.620806493454616]
Large Language Models for Vision (vLLMs) generate detailed perceptual outcomes, including bounding boxes and masks.
In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries.
We present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format.
arXiv Detail & Related papers (2024-04-12T14:40:45Z) - Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure.
Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction.
We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z) - Temporal Saliency Query Network for Efficient Video Recognition [82.52760040577864]
Video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices.
Most existing methods select the salient frames without awareness of the class-specific saliency scores.
We propose a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement.
arXiv Detail & Related papers (2022-07-21T09:23:34Z) - Action Quality Assessment with Temporal Parsing Transformer [84.1272079121699]
Action Quality Assessment (AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences.
We propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations.
Our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.
arXiv Detail & Related papers (2022-07-19T13:29:05Z) - Video Is Graph: Structured Graph Module for Video Action Recognition [34.918667614077805]
We transform a video sequence into a graph to obtain direct long-term dependencies among temporal frames.
In particular, SGM divides the neighbors of each node into several temporal regions so as to extract global structural information.
The reported performance and analysis demonstrate that SGM can achieve outstanding precision with less computational complexity.
arXiv Detail & Related papers (2021-10-12T11:27:29Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Learnable Dynamic Temporal Pooling for Time Series Classification [22.931314501371805]
We present a dynamic temporal pooling (DTP) technique that reduces the temporal size of hidden representations by aggregating the features at the segment-level.
For the partition of a whole series into multiple segments, we utilize dynamic time warping (DTW) to align each time point in a temporal order with the prototypical features of the segments.
The DTP layer combined with a fully-connected layer helps to extract further discriminative features considering their temporal position within an input time series.
arXiv Detail & Related papers (2021-04-02T08:58:44Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.