Related papers: Tracking Anything with Decoupled Video Segmentation

Tracking Anything with Decoupled Video Segmentation

URL: http://arxiv.org/abs/2309.03903v1
Date: Thu, 7 Sep 2023 17:59:41 GMT
Title: Tracking Anything with Decoupled Video Segmentation
Authors: Ho Kei Cheng, Seoung Wug Oh, Brian Price, Alexander Schwing, Joon-Young Lee
Abstract summary: We develop a decoupled video segmentation approach (DEVA) It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
Score: 87.07258378407289
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA

Related papers

Parameter-free Video Segmentation for Vision and Language Understanding [55.20132267309382]
We propose an algorithm for segmenting videos into contiguous chunks, based on the minimum description length principle. The algorithm is entirely parameter-free, given feature vectors, not requiring a set threshold or the number or size of chunks to be specified.
arXiv Detail & Related papers (2025-03-03T05:54:37Z)
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models [37.43195217391341]
In this paper, we define the union of segmentation and reasoning segmentation at both the image and video levels as Instructed Visual (IVS) Specifically, we employ an object-aware video perceiver to extract temporal and object information from reference frames, facilitating comprehensive video understanding. By leveraging multi-task and end-to-end training, InstructSeg demonstrates superior performance across diverse image and video segmentation tasks.
arXiv Detail & Related papers (2024-12-18T16:20:40Z)
ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation [14.534308478766476]
We introduce ViCaS, a new dataset containing thousands of challenging videos.<n>Our benchmark evaluates models on holistic/high-level understanding and language-guided, pixel-precise segmentation.
arXiv Detail & Related papers (2024-12-12T23:10:54Z)
Multi-Granularity Video Object Segmentation [36.06127939037613]
We propose a large-scale, densely annotated multi-granularity video object segmentation (MUG-VOS) dataset. We automatically collected a training set that assists in tracking both salient and non-salient objects, and we also curated a human-annotated test set for reliable evaluation. In addition, we present memory-based mask propagation model (MMPM), trained and evaluated on MUG-VOS dataset.
arXiv Detail & Related papers (2024-12-02T13:17:41Z)
ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation. The task is designed to output tracklets of segmentation masks given a complex input text query. We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z)
OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs. We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z)
DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework. By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z)
Tag-Based Attention Guided Bottom-Up Approach for Video Instance Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence. We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach. Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z)
Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos. Most clip-level methods are neither end-to-end learnable nor real-time. This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z)
Merging Tasks for Video Panoptic Segmentation [0.0]
Video panoptic segmentation (VPS) is a recently introduced computer vision task that requires classifying and tracking every pixel in a given video. To understand video panoptic segmentation, first, earlier introduced constituent tasks that focus on semantics and tracking separately will be researched. Two data-driven approaches which do not require training on a tailored dataset will be selected to solve it.
arXiv Detail & Related papers (2021-07-10T08:46:42Z)
Temporally-Weighted Hierarchical Clustering for Unsupervised Action Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos. We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training. Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z)
STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos [17.232631075144592]
Methods for instance segmentation in videos typically follow the tracking-by-detection paradigm. We propose a novel approach that segments and tracks instances across space and time in a single stage. Our method achieves state-of-the-art results across multiple datasets and tasks.
arXiv Detail & Related papers (2020-03-18T18:40:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.