TarViS: A Unified Approach for Target-based Video Segmentation
- URL: http://arxiv.org/abs/2301.02657v2
- Date: Wed, 10 May 2023 16:40:04 GMT
- Title: TarViS: A Unified Approach for Target-based Video Segmentation
- Authors: Ali Athar, Alexander Hermans, Jonathon Luiten, Deva Ramanan, Bastian
Leibe
- Abstract summary: TarViS is a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video.
Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks.
To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance (VIS), Video Panoptic (VPS), Video Object (VOS) and Point Exemplar-guided Tracking (PET)
- Score: 115.5770357189209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The general domain of video segmentation is currently fragmented into
different tasks spanning multiple benchmarks. Despite rapid progress in the
state-of-the-art, current methods are overwhelmingly task-specific and cannot
conceptually generalize to other tasks. Inspired by recent approaches with
multi-task capability, we propose TarViS: a novel, unified network architecture
that can be applied to any task that requires segmenting a set of arbitrarily
defined 'targets' in video. Our approach is flexible with respect to how tasks
define these targets, since it models the latter as abstract 'queries' which
are then used to predict pixel-precise target masks. A single TarViS model can
be trained jointly on a collection of datasets spanning different tasks, and
can hot-swap between tasks during inference without any task-specific
retraining. To demonstrate its effectiveness, we apply TarViS to four different
tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation
(VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking
(PET). Our unified, jointly trained model achieves state-of-the-art performance
on 5/7 benchmarks spanning these four tasks, and competitive performance on the
remaining two. Code and model weights are available at:
https://github.com/Ali2500/TarViS
Related papers
- Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks [26.007846170517055]
We propose a single unified framework, coined as Temporal2Seq, to formulate the output of temporal video understanding tasks as a sequence of discrete tokens.
With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks.
We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks.
arXiv Detail & Related papers (2024-09-27T06:37:47Z) - OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs.
We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - BURST: A Benchmark for Unifying Object Recognition, Segmentation and
Tracking in Video [58.71785546245467]
Multiple existing benchmarks involve tracking and segmenting objects in video.
There is little interaction between them due to the use of disparate benchmark datasets and metrics.
We propose BURST, a dataset which contains thousands of diverse videos with high-quality object masks.
All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison.
arXiv Detail & Related papers (2022-09-25T01:27:35Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Merging Tasks for Video Panoptic Segmentation [0.0]
Video panoptic segmentation (VPS) is a recently introduced computer vision task that requires classifying and tracking every pixel in a given video.
To understand video panoptic segmentation, first, earlier introduced constituent tasks that focus on semantics and tracking separately will be researched.
Two data-driven approaches which do not require training on a tailored dataset will be selected to solve it.
arXiv Detail & Related papers (2021-07-10T08:46:42Z) - Conditional Channel Gated Networks for Task-Aware Continual Learning [44.894710899300435]
Convolutional Neural Networks experience catastrophic forgetting when optimized on a sequence of learning problems.
We introduce a novel framework to tackle this problem with conditional computation.
We validate our proposal on four continual learning datasets.
arXiv Detail & Related papers (2020-03-31T19:35:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.