STEP: Segmenting and Tracking Every Pixel
- URL: http://arxiv.org/abs/2102.11859v1
- Date: Tue, 23 Feb 2021 18:43:02 GMT
- Title: STEP: Segmenting and Tracking Every Pixel
- Authors: Mark Weber, Jun Xie, Maxwell Collins, Yukun Zhu, Paul Voigtlaender,
Hartwig Adam, Bradley Green, Andreas Geiger, Bastian Leibe, Daniel Cremers,
Aljosa Osep, Laura Leal-Taixe, Liang-Chieh Chen
- Abstract summary: We present a new benchmark: Segmenting and Tracking Every Pixel (STEP)
Our work is the first that targets this task in a real-world setting that requires dense interpretation in both spatial and temporal domains.
For measuring the performance, we propose a novel evaluation metric and Tracking Quality (STQ)
- Score: 107.23184053133636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we tackle video panoptic segmentation, a task that requires
assigning semantic classes and track identities to all pixels in a video. To
study this important problem in a setting that requires a continuous
interpretation of sensory data, we present a new benchmark: Segmenting and
Tracking Every Pixel (STEP), encompassing two datasets, KITTI-STEP, and
MOTChallenge-STEP together with a new evaluation metric. Our work is the first
that targets this task in a real-world setting that requires dense
interpretation in both spatial and temporal domains. As the ground-truth for
this task is difficult and expensive to obtain, existing datasets are either
constructed synthetically or only sparsely annotated within short video clips.
By contrast, our datasets contain long video sequences, providing challenging
examples and a test-bed for studying long-term pixel-precise segmentation and
tracking. For measuring the performance, we propose a novel evaluation metric
Segmentation and Tracking Quality (STQ) that fairly balances semantic and
tracking aspects of this task and is suitable for evaluating sequences of
arbitrary length. We will make our datasets, metric, and baselines publicly
available.
Related papers
- Frequency-based Matcher for Long-tailed Semantic Segmentation [22.199174076366003]
We focus on a relatively under-explored task setting, long-tailed semantic segmentation (LTSS)
We propose a dual-metric evaluation system and construct the LTSS benchmark to demonstrate the performance of semantic segmentation methods and long-tailed solutions.
We also propose a transformer-based algorithm to improve LTSS, frequency-based matcher, which solves the oversuppression problem by one-to-many matching.
arXiv Detail & Related papers (2024-06-06T09:57:56Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - Structured Summarization: Unified Text Segmentation and Segment Labeling
as a Generation Task [16.155438404910043]
We propose a single encoder-decoder neural network that can handle long documents and conversations.
We successfully show a way to solve the combined task as a pure generation task.
Our results establish a strong case for considering text segmentation and segment labeling as a whole.
arXiv Detail & Related papers (2022-09-28T01:08:50Z) - A Pixel-Level Meta-Learner for Weakly Supervised Few-Shot Semantic
Segmentation [40.27705176115985]
Few-shot semantic segmentation addresses the learning task in which only few images with ground truth pixel-level labels are available for the novel classes of interest.
We propose a novel meta-learning framework, which predicts pseudo pixel-level segmentation masks from a limited amount of data and their semantic labels.
Our proposed learning model can be viewed as a pixel-level meta-learner.
arXiv Detail & Related papers (2021-11-02T08:28:11Z) - Quantifying the Task-Specific Information in Text-Based Classifications [20.148222318025528]
Shortcuts in datasets do not contribute to the *task-specific information* (TSI) of the classification tasks.
In this paper, we consider how much task-specific information is required to classify a dataset.
This framework allows us to compare across datasets, saying that, apart from a set of shortcut features'', classifying each sample in the Multi-NLI task involves around 0.4 nats more TSI than in the Quora Question Pair.
arXiv Detail & Related papers (2021-10-17T21:54:38Z) - Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.