Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models
- URL: http://arxiv.org/abs/2310.06992v2
- Date: Thu, 25 Jan 2024 08:11:43 GMT
- Title: Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models
- Authors: Wen-Hsuan Chu, Adam W. Harley, Pavel Tokmakov, Achal Dave, Leonidas
Guibas, Katerina Fragkiadaki
- Abstract summary: Large-scale pre-trained models have shown promising advances in detecting and segmenting objects in 2D static images in the wild.
This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking?
In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos.
- Score: 28.304047711166056
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Object tracking is central to robot perception and scene understanding.
Tracking-by-detection has long been a dominant paradigm for object tracking of
specific object categories. Recently, large-scale pre-trained models have shown
promising advances in detecting and segmenting objects and parts in 2D static
images in the wild. This begs the question: can we re-purpose these large-scale
pre-trained static image models for open-vocabulary video tracking? In this
paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical
flow estimator, into a model that tracks and segments objects of any category
in 2D videos. Our method predicts object and part tracks with associated
language descriptions in monocular videos, rebuilding the pipeline of Tractor
with modern large pre-trained models for static image detection and
segmentation: we detect open-vocabulary object instances and propagate their
boxes from frame to frame using a flow-based motion model, refine the
propagated boxes with the box regression module of the visual detector, and
prompt an open-world segmenter with the refined box to segment the objects. We
decide the termination of an object track based on the objectness score of the
propagated boxes, as well as forward-backward optical flow consistency. We
re-identify objects across occlusions using deep feature matching. We show that
our model achieves strong performance on multiple established video object
segmentation and tracking benchmarks, and can produce reasonable tracks in
manipulation data. In particular, our model outperforms previous
state-of-the-art in UVO and BURST, benchmarks for open-world object tracking
and segmentation, despite never being explicitly trained for tracking. We hope
that our approach can serve as a simple and extensible framework for future
research.
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Conditional Object-Centric Learning from Video [34.012087337046005]
We introduce a sequential extension to Slot Attention to predict optical flow for realistic looking synthetic scenes.
We show that conditioning the initial state of this model on a small set of hints, such as center of mass of objects in the first frame, is sufficient to significantly improve instance segmentation.
These benefits generalize beyond the training distribution to novel objects, novel backgrounds, and to longer video sequences.
arXiv Detail & Related papers (2021-11-24T16:10:46Z) - The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [59.12750806239545]
We show that a video has different views of the same scene related by moving components, and the right region segmentation and region flow would allow mutual view synthesis.
Our model starts with two separate pathways: an appearance pathway that outputs feature-based region segmentation for a single image, and a motion pathway that outputs motion features for a pair of images.
By training the model to minimize view synthesis errors based on segment flow, our appearance and motion pathways learn region segmentation and flow estimation automatically without building them up from low-level edges or optical flows respectively.
arXiv Detail & Related papers (2021-11-11T18:59:11Z) - Learning to Track with Object Permanence [61.36492084090744]
We introduce an end-to-end trainable approach for joint object detection and tracking.
Our model, trained jointly on synthetic and real data, outperforms the state of the art on KITTI, and MOT17 datasets.
arXiv Detail & Related papers (2021-03-26T04:43:04Z) - DyStaB: Unsupervised Object Segmentation via Dynamic-Static
Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole.
Our method first partitions the motion field by minimizing the mutual information between segments.
It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z) - AutoTrajectory: Label-free Trajectory Extraction and Prediction from
Videos using Dynamic Points [92.91569287889203]
We present a novel, label-free algorithm, AutoTrajectory, for trajectory extraction and prediction.
To better capture the moving objects in videos, we introduce dynamic points.
We aggregate dynamic points to instance points, which stand for moving objects such as pedestrians in videos.
arXiv Detail & Related papers (2020-07-11T08:43:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.