STEP: Segmenting and Tracking Every Pixel
- URL: http://arxiv.org/abs/2102.11859v1
- Date: Tue, 23 Feb 2021 18:43:02 GMT
- Title: STEP: Segmenting and Tracking Every Pixel
- Authors: Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender,
Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers,
Aljosa Osep, Laura Leal-Taixe, Liang-Chieh Chen
- Abstract summary: We present a new benchmark: Segmenting and Tracking Every Pixel (STEP)
Our work is the first that targets this task in a real-world setting that requires dense interpretation in both spatial and temporal domains.
For measuring the performance, we propose a novel evaluation metric and Tracking Quality (STQ)
- Score: 107.23184053133636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we tackle video panoptic segmentation, a task that requires
assigning semantic classes and track identities to all pixels in a video. To
study this important problem in a setting that requires a continuous
interpretation of sensory data, we present a new benchmark: Segmenting and
Tracking Every Pixel (STEP), encompassing two datasets, KITTI-STEP, and
MOTChallenge-STEP together with a new evaluation metric. Our work is the first
that targets this task in a real-world setting that requires dense
interpretation in both spatial and temporal domains. As the ground-truth for
this task is difficult and expensive to obtain, existing datasets are either
constructed synthetically or only sparsely annotated within short video clips.
By contrast, our datasets contain long video sequences, providing challenging
examples and a test-bed for studying long-term pixel-precise segmentation and
tracking. For measuring the performance, we propose a novel evaluation metric
Segmentation and Tracking Quality (STQ) that fairly balances semantic and
tracking aspects of this task and is suitable for evaluating sequences of
arbitrary length. We will make our datasets, metric, and baselines publicly
available.
Related papers
- ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation [14.534308478766476]
This paper introduces ViCaS, a new dataset containing thousands of challenging videos.
Our benchmark evaluates models on holistic/high-level understanding and language-guided, pixel-precise segmentation.
arXiv Detail & Related papers (2024-12-12T23:10:54Z) - Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset.
We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation.
In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z) - Frequency-based Matcher for Long-tailed Semantic Segmentation [22.199174076366003]
We focus on a relatively under-explored task setting, long-tailed semantic segmentation (LTSS)
We propose a dual-metric evaluation system and construct the LTSS benchmark to demonstrate the performance of semantic segmentation methods and long-tailed solutions.
We also propose a transformer-based algorithm to improve LTSS, frequency-based matcher, which solves the oversuppression problem by one-to-many matching.
arXiv Detail & Related papers (2024-06-06T09:57:56Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - Structured Summarization: Unified Text Segmentation and Segment Labeling
as a Generation Task [16.155438404910043]
We propose a single encoder-decoder neural network that can handle long documents and conversations.
We successfully show a way to solve the combined task as a pure generation task.
Our results establish a strong case for considering text segmentation and segment labeling as a whole.
arXiv Detail & Related papers (2022-09-28T01:08:50Z) - Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.