STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
- URL: http://arxiv.org/abs/2003.08429v4
- Date: Fri, 1 Sep 2023 13:25:14 GMT
- Title: STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
- Authors: Ali Athar, Sabarinath Mahadevan, Aljo\v{s}a O\v{s}ep, Laura
Leal-Taix\'e, Bastian Leibe
- Abstract summary: Methods for instance segmentation in videos typically follow the tracking-by-detection paradigm.
We propose a novel approach that segments and tracks instances across space and time in a single stage.
Our method achieves state-of-the-art results across multiple datasets and tasks.
- Score: 17.232631075144592
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing methods for instance segmentation in videos typically involve
multi-stage pipelines that follow the tracking-by-detection paradigm and model
a video clip as a sequence of images. Multiple networks are used to detect
objects in individual frames, and then associate these detections over time.
Hence, these methods are often non-end-to-end trainable and highly tailored to
specific tasks. In this paper, we propose a different approach that is
well-suited to a variety of tasks involving instance segmentation in videos. In
particular, we model a video clip as a single 3D spatio-temporal volume, and
propose a novel approach that segments and tracks instances across space and
time in a single stage. Our problem formulation is centered around the idea of
spatio-temporal embeddings which are trained to cluster pixels belonging to a
specific object instance over an entire video clip. To this end, we introduce
(i) novel mixing functions that enhance the feature representation of
spatio-temporal embeddings, and (ii) a single-stage, proposal-free network that
can reason about temporal context. Our network is trained end-to-end to learn
spatio-temporal embeddings as well as parameters required to cluster these
embeddings, thus simplifying inference. Our method achieves state-of-the-art
results across multiple datasets and tasks. Code and models are available at
https://github.com/sabarim/STEm-Seg.
Related papers
- Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z) - Tracking Anything with Decoupled Video Segmentation [87.07258378407289]
We develop a decoupled video segmentation approach (DEVA)
It is composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation.
We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks.
arXiv Detail & Related papers (2023-09-07T17:59:41Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning.
Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Less than Few: Self-Shot Video Instance Segmentation [50.637278655763616]
We propose to automatically learn to find appropriate support videos given a query.
We tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting.
We provide strong baseline performances that utilize a novel transformer-based model.
arXiv Detail & Related papers (2022-04-19T13:14:43Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in
Untrimmed Sequences [25.299599341774204]
This paper proposes an approach for the unsupervised learning of actions in untrimmed video sequences based on a joint visual-temporal embedding space.
We show that the proposed approach is able to provide a meaningful visual and temporal embedding out of the visual cues present in contiguous video frames.
arXiv Detail & Related papers (2020-01-29T22:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.