Related papers: STEP: Segmenting and Tracking Every Pixel

STEP: Segmenting and Tracking Every Pixel

URL: http://arxiv.org/abs/2102.11859v1
Date: Tue, 23 Feb 2021 18:43:02 GMT
Title: STEP: Segmenting and Tracking Every Pixel
Authors: Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender, Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers, Aljosa Osep, Laura Leal-Taixe, Liang-Chieh Chen
Abstract summary: We present a new benchmark: Segmenting and Tracking Every Pixel (STEP) Our work is the first that targets this task in a real-world setting that requires dense interpretation in both spatial and temporal domains. For measuring the performance, we propose a novel evaluation metric and Tracking Quality (STQ)
Score: 107.23184053133636
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we tackle video panoptic segmentation, a task that requires assigning semantic classes and track identities to all pixels in a video. To study this important problem in a setting that requires a continuous interpretation of sensory data, we present a new benchmark: Segmenting and Tracking Every Pixel (STEP), encompassing two datasets, KITTI-STEP, and MOTChallenge-STEP together with a new evaluation metric. Our work is the first that targets this task in a real-world setting that requires dense interpretation in both spatial and temporal domains. As the ground-truth for this task is difficult and expensive to obtain, existing datasets are either constructed synthetically or only sparsely annotated within short video clips. By contrast, our datasets contain long video sequences, providing challenging examples and a test-bed for studying long-term pixel-precise segmentation and tracking. For measuring the performance, we propose a novel evaluation metric Segmentation and Tracking Quality (STQ) that fairly balances semantic and tracking aspects of this task and is suitable for evaluating sequences of arbitrary length. We will make our datasets, metric, and baselines publicly available.

Related papers

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation [14.534308478766476]
We introduce ViCaS, a new dataset containing thousands of challenging videos. Our benchmark evaluates models on holistic/high-level understanding and language-guided, pixel-precise segmentation.
arXiv Detail & Related papers (2024-12-12T23:10:54Z)
Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z)
Frequency-based Matcher for Long-tailed Semantic Segmentation [22.199174076366003]
We focus on a relatively under-explored task setting, long-tailed semantic segmentation (LTSS) We propose a dual-metric evaluation system and construct the LTSS benchmark to demonstrate the performance of semantic segmentation methods and long-tailed solutions. We also propose a transformer-based algorithm to improve LTSS, frequency-based matcher, which solves the oversuppression problem by one-to-many matching.
arXiv Detail & Related papers (2024-06-06T09:57:56Z)
Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning. This task unifies spatial and temporal localization in video. We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z)
A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented, Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles. Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z)
Structured Summarization: Unified Text Segmentation and Segment Labeling as a Generation Task [16.155438404910043]
We propose a single encoder-decoder neural network that can handle long documents and conversations. We successfully show a way to solve the combined task as a pure generation task. Our results establish a strong case for considering text segmentation and segment labeling as a whole.
arXiv Detail & Related papers (2022-09-28T01:08:50Z)
A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic Segmentation [40.27705176115985]
Few-shot semantic segmentation addresses the learning task in which only few images with ground truth pixel-level labels are available for the novel classes of interest. We propose a novel meta-learning framework, which predicts pseudo pixel-level segmentation masks from a limited amount of data and their semantic labels. Our proposed learning model can be viewed as a pixel-level meta-learner.
arXiv Detail & Related papers (2021-11-02T08:28:11Z)
Quantifying the Task-Specific Information in Text-Based Classifications [20.148222318025528]
Shortcuts in datasets do not contribute to the *task-specific information* (TSI) of the classification tasks. In this paper, we consider how much task-specific information is required to classify a dataset. This framework allows us to compare across datasets, saying that, apart from a set of shortcut features'', classifying each sample in the Multi-NLI task involves around 0.4 nats more TSI than in the Quora Question Pair.
arXiv Detail & Related papers (2021-10-17T21:54:38Z)
Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online. PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z)
Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS) We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance. We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation. To invigorate research on this new task, we present two types of video panoptic datasets. We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.