Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation
- URL: http://arxiv.org/abs/2204.10765v1
- Date: Fri, 22 Apr 2022 15:32:46 GMT
- Title: Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation
- Authors: Jyoti Kini and Mubarak Shah
- Abstract summary: Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
- Score: 83.13610762450703
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video Instance Segmentation is a fundamental computer vision task that deals
with segmenting and tracking object instances across a video sequence. Most
existing methods typically accomplish this task by employing a multi-stage
top-down approach that usually involves separate networks to detect and segment
objects in each frame, followed by associating these detections in consecutive
frames using a learned tracking head. In this work, however, we introduce a
simple end-to-end trainable bottom-up approach to achieve instance mask
predictions at the pixel-level granularity, instead of the typical
region-proposals-based approach. Unlike contemporary frame-based models, our
network pipeline processes an input video clip as a single 3D volume to
incorporate temporal information. The central idea of our formulation is to
solve the video instance segmentation task as a tag assignment problem, such
that generating distinct tag values essentially separates individual object
instances across the video sequence (here each tag could be any arbitrary value
between 0 and 1). To this end, we propose a novel spatio-temporal tagging loss
that allows for sufficient separation of different objects as well as necessary
identification of different instances of the same object. Furthermore, we
present a tag-based attention module that improves instance tags, while
concurrently learning instance propagation within a video. Evaluations
demonstrate that our method provides competitive results on YouTube-VIS and
DAVIS-19 datasets, and has minimum run-time compared to other state-of-the-art
performance methods.
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network.
The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - Human Instance Segmentation and Tracking via Data Association and
Single-stage Detector [17.46922710432633]
Human video instance segmentation plays an important role in computer understanding of human activities.
Most current VIS methods are based on Mask-RCNN framework.
We develop a new method for human video instance segmentation based on single-stage detector.
arXiv Detail & Related papers (2022-03-31T11:36:09Z) - Video Instance Segmentation by Instance Flow Assembly [23.001856276175506]
Bottom-up methods dealing with box-free features could offer accurate spacial correlations across frames.
We propose our framework equipped with a temporal context fusion module to better encode inter-frame correlations.
Experiments demonstrate that the proposed method outperforms the state-of-the-art online methods (taking image-level input) on the challenging Youtube-VIS dataset.
arXiv Detail & Related papers (2021-10-20T14:49:28Z) - Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z) - STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos [17.232631075144592]
Methods for instance segmentation in videos typically follow the tracking-by-detection paradigm.
We propose a novel approach that segments and tracks instances across space and time in a single stage.
Our method achieves state-of-the-art results across multiple datasets and tasks.
arXiv Detail & Related papers (2020-03-18T18:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.