TubeFormer-DeepLab: Video Mask Transformer
- URL: http://arxiv.org/abs/2205.15361v1
- Date: Mon, 30 May 2022 18:10:33 GMT
- Title: TubeFormer-DeepLab: Video Mask Transformer
- Authors: Dahun Kim, Jun Xie, Huiyu Wang, Siyuan Qiao, Qihang Yu, Hong-Seok Kim,
Hartwig Adam, In So Kweon and Liang-Chieh Chen
- Abstract summary: We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner.
TubeFormer-DeepLab directly predicts video tubes with task-specific labels.
- Score: 98.47947102154217
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present TubeFormer-DeepLab, the first attempt to tackle multiple core
video segmentation tasks in a unified manner. Different video segmentation
tasks (e.g., video semantic/instance/panoptic segmentation) are usually
considered as distinct problems. State-of-the-art models adopted in the
separate communities have diverged, and radically different approaches dominate
in each task. By contrast, we make a crucial observation that video
segmentation tasks could be generally formulated as the problem of assigning
different predicted labels to video tubes (where a tube is obtained by linking
segmentation masks along the time axis) and the labels may encode different
values depending on the target task. The observation motivates us to develop
TubeFormer-DeepLab, a simple and effective video mask transformer model that is
widely applicable to multiple video segmentation tasks. TubeFormer-DeepLab
directly predicts video tubes with task-specific labels (either pure semantic
categories, or both semantic categories and instance identities), which not
only significantly simplifies video segmentation models, but also advances
state-of-the-art results on multiple video segmentation benchmarks
Related papers
- Multi-Modal Video Topic Segmentation with Dual-Contrastive Domain
Adaptation [74.51546366251753]
Video topic segmentation unveils the coarse-grained semantic structure underlying videos.
We introduce a multi-modal video topic segmenter that utilizes both video transcripts and frames.
Our proposed solution significantly surpasses baseline methods in terms of both accuracy and transferability.
arXiv Detail & Related papers (2023-11-30T21:59:05Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - Multi-Attention Network for Compressed Video Referring Object
Segmentation [103.18477550023513]
Referring video object segmentation aims to segment the object referred by a given language expression.
Existing works typically require compressed video bitstream to be decoded to RGB frames before being segmented.
This may hamper its application in real-world computing resource limited scenarios, such as autonomous cars and drones.
arXiv Detail & Related papers (2022-07-26T03:00:52Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Merging Tasks for Video Panoptic Segmentation [0.0]
Video panoptic segmentation (VPS) is a recently introduced computer vision task that requires classifying and tracking every pixel in a given video.
To understand video panoptic segmentation, first, earlier introduced constituent tasks that focus on semantics and tracking separately will be researched.
Two data-driven approaches which do not require training on a tailored dataset will be selected to solve it.
arXiv Detail & Related papers (2021-07-10T08:46:42Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.