Tube-Link: A Flexible Cross Tube Framework for Universal Video
Segmentation
- URL: http://arxiv.org/abs/2303.12782v3
- Date: Mon, 21 Aug 2023 12:46:09 GMT
- Title: Tube-Link: A Flexible Cross Tube Framework for Universal Video
Segmentation
- Authors: Xiangtai Li, Haobo Yuan, Wenwei Zhang, Guangliang Cheng, Jiangmiao
Pang, Chen Change Loy
- Abstract summary: Tube-Link is a versatile framework that addresses multiple core tasks of video segmentation with a unified architecture.
Our framework is a near-online approach that takes a short subclip as input and outputs the corresponding spatial-temporal tube masks.
- Score: 83.65774845267622
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video segmentation aims to segment and track every pixel in diverse scenarios
accurately. In this paper, we present Tube-Link, a versatile framework that
addresses multiple core tasks of video segmentation with a unified
architecture. Our framework is a near-online approach that takes a short
subclip as input and outputs the corresponding spatial-temporal tube masks. To
enhance the modeling of cross-tube relationships, we propose an effective way
to perform tube-level linking via attention along the queries. In addition, we
introduce temporal contrastive learning to instance-wise discriminative
features for tube-level association. Our approach offers flexibility and
efficiency for both short and long video inputs, as the length of each subclip
can be varied according to the needs of datasets or scenarios. Tube-Link
outperforms existing specialized architectures by a significant margin on five
video segmentation datasets. Specifically, it achieves almost 13% relative
improvements on VIPSeg and 4% improvements on KITTI-STEP over the strong
baseline Video K-Net. When using a ResNet50 backbone on Youtube-VIS-2019 and
2021, Tube-Link boosts IDOL by 3% and 4%, respectively.
Related papers
- Making Every Frame Matter: Continuous Video Understanding for Large Models via Adaptive State Modeling [14.450847211200292]
Video understanding has become increasingly important with the rise of multi-modality applications.
We introduce a novel system, C-VUE, to overcome these issues through adaptive state modeling.
C-VUE has three key designs. The first is a long-range history modeling technique that uses a video-aware approach to retain historical video information.
The second is a spatial redundancy reduction technique, which enhances the efficiency of history modeling based on temporal relations.
arXiv Detail & Related papers (2024-10-19T05:50:00Z) - Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets [62.280729345770936]
We introduce the task of Alignable Video Retrieval (AVR)
Given a query video, our approach can identify well-alignable videos from a large collection of clips and temporally synchronize them to the query.
Our experiments on 3 datasets, including large-scale Kinetics700, demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-02T20:00:49Z) - Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - DVIS++: Improved Decoupled Framework for Universal Video Segmentation [30.703276476607545]
We present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
By integrating CLIP with DVIS++, we present OV-DVIS++, the first open-vocabulary universal video segmentation framework.
arXiv Detail & Related papers (2023-12-20T03:01:33Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - In Defense of Clip-based Video Relation Detection [32.05021939177942]
Video Visual Relation Detection (VidVRD) aims to detect visual relationship triplets in videos using spatial bounding boxes and temporal boundaries.
We propose a Hierarchical Context Model (HCM) that enriches the object-based spatial context and relation-based temporal context based on clips.
Our HCM achieves a new state-of-the-art performance, highlighting the effectiveness of incorporating advanced spatial and temporal context modeling within the clip-based paradigm.
arXiv Detail & Related papers (2023-07-18T05:42:01Z) - Efficient Video Instance Segmentation via Tracklet Query and Proposal [62.897552852894854]
Video Instance aims to simultaneously classify, segment, and track multiple object instances in videos.
Most clip-level methods are neither end-to-end learnable nor real-time.
This paper proposes EfficientVIS, a fully end-to-end framework with efficient training and inference.
arXiv Detail & Related papers (2022-03-03T17:00:11Z) - STC: Spatio-Temporal Contrastive Learning for Video Instance
Segmentation [47.28515170195206]
Video Instance (VIS) is a task that simultaneously requires classification, segmentation, and instance association in a video.
Recent VIS approaches rely on sophisticated pipelines to achieve this goal, including RoI-related operations or 3D convolutions.
We present a simple and efficient single-stage VIS framework based on the instance segmentation method ConInst.
arXiv Detail & Related papers (2022-02-08T09:34:26Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - Few-Shot Video Object Detection [70.43402912344327]
We introduce Few-Shot Video Object Detection (FSVOD) with three important contributions.
FSVOD-500 comprises of 500 classes with class-balanced videos in each category for few-shot learning.
Our TPN and TMN+ are jointly and end-to-end trained.
arXiv Detail & Related papers (2021-04-30T07:38:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.