Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual
Tracking and Segmentation
- URL: http://arxiv.org/abs/2308.13266v3
- Date: Thu, 21 Sep 2023 06:21:48 GMT
- Title: Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual
Tracking and Segmentation
- Authors: Yuanyou Xu, Zongxin Yang, Yi Yang
- Abstract summary: This paper proposes a Multi-object Mask-box Integrated framework for unified Tracking and representation.
A novel pinpoint box predictor is proposed for accurate multi-object box prediction.
MITS achieves state-of-the-art performance on both Visual Object Tracking (VOT) and Video Object Tracking (VOS) benchmarks.
- Score: 37.85026590250023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tracking any given object(s) spatially and temporally is a common purpose in
Visual Object Tracking (VOT) and Video Object Segmentation (VOS). Joint
tracking and segmentation have been attempted in some studies but they often
lack full compatibility of both box and mask in initialization and prediction,
and mainly focus on single-object scenarios. To address these limitations, this
paper proposes a Multi-object Mask-box Integrated framework for unified
Tracking and Segmentation, dubbed MITS. Firstly, the unified identification
module is proposed to support both box and mask reference for initialization,
where detailed object information is inferred from boxes or directly retained
from masks. Additionally, a novel pinpoint box predictor is proposed for
accurate multi-object box prediction, facilitating target-oriented
representation learning. All target objects are processed simultaneously from
encoding to propagation and decoding, as a unified pipeline for VOT and VOS.
Experimental results show MITS achieves state-of-the-art performance on both
VOT and VOS benchmarks. Notably, MITS surpasses the best prior VOT competitor
by around 6% on the GOT-10k test set, and significantly improves the
performance of box initialization on VOS benchmarks. The code is available at
https://github.com/yoxu515/MITS.
Related papers
- Beyond SOT: Tracking Multiple Generic Objects at Once [141.36900362724975]
Generic Object Tracking (GOT) is the problem of tracking target objects, specified by bounding boxes in the first frame of a video.
We introduce a new large-scale GOT benchmark, LaGOT, containing multiple annotated target objects per sequence.
Our approach achieves highly competitive results on single-object GOT datasets, setting a new state of the art on TrackingNet with a success rate AUC of 84.4%.
arXiv Detail & Related papers (2022-12-22T17:59:19Z) - BURST: A Benchmark for Unifying Object Recognition, Segmentation and
Tracking in Video [58.71785546245467]
Multiple existing benchmarks involve tracking and segmenting objects in video.
There is little interaction between them due to the use of disparate benchmark datasets and metrics.
We propose BURST, a dataset which contains thousands of diverse videos with high-quality object masks.
All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison.
arXiv Detail & Related papers (2022-09-25T01:27:35Z) - Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS)
We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST)
Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z) - Robust Visual Tracking by Segmentation [103.87369380021441]
Estimating the target extent poses a fundamental challenge in visual object tracking.
We propose a segmentation-centric tracking pipeline that produces a highly accurate segmentation mask.
Our tracker is able to better learn a target representation that clearly differentiates the target in the scene from background content.
arXiv Detail & Related papers (2022-03-21T17:59:19Z) - Prototypical Cross-Attention Networks for Multiple Object Tracking and
Segmentation [95.74244714914052]
Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes.
We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich-temporal information online.
PCAN outperforms current video instance tracking and segmentation competition winners on Youtube-VIS and BDD100K datasets.
arXiv Detail & Related papers (2021-06-22T17:57:24Z) - Single Object Tracking through a Fast and Effective Single-Multiple
Model Convolutional Neural Network [0.0]
Recent state-of-the-art (SOTA) approaches are proposed based on taking a matching network with a heavy structure to distinguish the target from other objects in the area.
In this article, a special architecture is proposed based on which in contrast to the previous approaches, it is possible to identify the object location in a single shot.
The presented tracker performs comparatively with the SOTA in challenging situations while having a super speed compared to them (up to $120 FPS$ on 1080ti)
arXiv Detail & Related papers (2021-03-28T11:02:14Z) - Make One-Shot Video Object Segmentation Efficient Again [7.7415390727490445]
Video object segmentation (VOS) describes the task of segmenting a set of objects in each frame of a video.
e-OSVOS decouples the object detection task and predicts only local segmentation masks by applying a modified version of Mask R-CNN.
e-OSVOS provides state-of-the-art results on DAVIS 2016, DAVIS 2017, and YouTube-VOS for one-shot fine-tuning methods.
arXiv Detail & Related papers (2020-12-03T12:21:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.