Related papers: Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation

Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation

URL: http://arxiv.org/abs/2309.08020v1
Date: Thu, 14 Sep 2023 20:31:06 GMT
Title: Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation
Authors: Zhaochong An, Guolei Sun, Zongwei Wu, Hao Tang, Luc Van Gool
Abstract summary: Video semantic segmentation dataset has limited categories per video. Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training. Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
Score: 62.275143240798236
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern approaches have proved the huge potential of addressing semantic segmentation as a mask classification task which is widely used in instance-level segmentation. This paradigm trains models by assigning part of object queries to ground truths via conventional one-to-one matching. However, we observe that the popular video semantic segmentation (VSS) dataset has limited categories per video, meaning less than 10% of queries could be matched to receive meaningful gradient updates during VSS training. This inefficiency limits the full expressive potential of all queries.Thus, we present a novel solution THE-Mask for VSS, which introduces temporal-aware hierarchical object queries for the first time. Specifically, we propose to use a simple two-round matching mechanism to involve more queries matched with minimal cost during training while without any extra cost during inference. To support our more-to-one assignment, in terms of the matching results, we further design a hierarchical loss to train queries with their corresponding hierarchy of primary or secondary. Moreover, to effectively capture temporal information across frames, we propose a temporal aggregation decoder that fits seamlessly into the mask-classification paradigm for VSS. Utilizing temporal-sensitive multi-level queries, our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.

Related papers

Temporal Prompting Matters: Rethinking Referring Video Object Segmentation [64.82333675385802]
Referring Video Object (RVOS) aims to segment the object referred to by the query sentence in the video.<n>Most existing methods require end-to-end training with dense mask annotations.<n>We propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors.
arXiv Detail & Related papers (2025-10-08T17:59:57Z)
CRISP: Contrastive Residual Injection and Semantic Prompting for Continual Video Instance Segmentation [22.488935184542154]
We introduce Contrastive Residual Injection and Semantic Prompting (CRISP)<n>CRISP aims to address the instance-wise, category-wise, and task-wise confusion in continual video instance segmentation.<n>It significantly outperforms existing continual segmentation methods in the long-term continual video instance segmentation task.
arXiv Detail & Related papers (2025-08-14T08:06:01Z)
Rethinking Query-based Transformer for Continual Image Segmentation [59.40646424650094]
Class-incremental/Continual image segmentation (CIS) aims to train an image segmenter in stages, where the set of available categories differs at each stage.<n>Current methods often decouple mask generation from the continual learning process.<n>This study, however, identifies two key issues with decoupled frameworks: loss of plasticity and heavy reliance on input data order.
arXiv Detail & Related papers (2025-07-10T15:03:10Z)
Temporal-consistent CAMs for Weakly Supervised Video Segmentation in Waste Sorting [14.324826668510504]
We present a WS method capable of producing accurate masks for semantic segmentation in the case of video streams. We build saliency maps that exploit the temporal coherence between consecutive frames in a video. We apply our method in a waste-sorting scenario, where we perform weakly supervised video segmentation.
arXiv Detail & Related papers (2025-02-03T15:43:33Z)
Hierarchical Multimodal LLMs with Semantic Space Alignment for Enhanced Time Series Classification [4.5939667818289385]
HiTime is a hierarchical multi-modal model that seamlessly integrates temporal information into large language models. Our findings highlight the potential of integrating temporal features into LLMs, paving the way for advanced time series analysis.
arXiv Detail & Related papers (2024-10-24T12:32:19Z)
Frequency-based Matcher for Long-tailed Semantic Segmentation [22.199174076366003]
We focus on a relatively under-explored task setting, long-tailed semantic segmentation (LTSS) We propose a dual-metric evaluation system and construct the LTSS benchmark to demonstrate the performance of semantic segmentation methods and long-tailed solutions. We also propose a transformer-based algorithm to improve LTSS, frequency-based matcher, which solves the oversuppression problem by one-to-many matching.
arXiv Detail & Related papers (2024-06-06T09:57:56Z)
LaSagnA: Language-based Segmentation Assistant for Complex Queries [39.620806493454616]
Large Language Models for Vision (vLLMs) generate detailed perceptual outcomes, including bounding boxes and masks. In this study, we acknowledge that the main cause of these problems is the insufficient complexity of training queries. We present three novel strategies to effectively handle the challenges arising from the direct integration of the proposed format.
arXiv Detail & Related papers (2024-04-12T14:40:45Z)
Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z)
Temporal Saliency Query Network for Efficient Video Recognition [82.52760040577864]
Video recognition is a hot-spot research topic with the explosive growth of multimedia data on the Internet and mobile devices. Most existing methods select the salient frames without awareness of the class-specific saliency scores. We propose a novel Temporal Saliency Query (TSQ) mechanism, which introduces class-specific information to provide fine-grained cues for saliency measurement.
arXiv Detail & Related papers (2022-07-21T09:23:34Z)
Action Quality Assessment with Temporal Parsing Transformer [84.1272079121699]
Action Quality Assessment (AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. We propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.
arXiv Detail & Related papers (2022-07-19T13:29:05Z)
Video Is Graph: Structured Graph Module for Video Action Recognition [34.918667614077805]
We transform a video sequence into a graph to obtain direct long-term dependencies among temporal frames. In particular, SGM divides the neighbors of each node into several temporal regions so as to extract global structural information. The reported performance and analysis demonstrate that SGM can achieve outstanding precision with less computational complexity.
arXiv Detail & Related papers (2021-10-12T11:27:29Z)
Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation. We introduce a novel approach for more accurate and efficient unseen-temporal segmentation. We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z)
Learnable Dynamic Temporal Pooling for Time Series Classification [22.931314501371805]
We present a dynamic temporal pooling (DTP) technique that reduces the temporal size of hidden representations by aggregating the features at the segment-level. For the partition of a whole series into multiple segments, we utilize dynamic time warping (DTW) to align each time point in a temporal order with the prototypical features of the segments. The DTP layer combined with a fully-connected layer helps to extract further discriminative features considering their temporal position within an input time series.
arXiv Detail & Related papers (2021-04-02T08:58:44Z)
Spatiotemporal Graph Neural Network based Mask Reconstruction for Video Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting. We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.