Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised
Framework with Spatio-Temporal Collaboration
- URL: http://arxiv.org/abs/2212.07592v1
- Date: Thu, 15 Dec 2022 02:44:13 GMT
- Title: Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised
Framework with Spatio-Temporal Collaboration
- Authors: Liqi Yan, Qifan Wang, Siqi Ma, Jingang Wang, Changbin Yu
- Abstract summary: We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos.
Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
- Score: 13.284951215948052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instance segmentation in videos, which aims to segment and track multiple
objects in video frames, has garnered a flurry of research attention in recent
years. In this paper, we present a novel weakly supervised framework with
\textbf{S}patio-\textbf{T}emporal \textbf{C}ollaboration for instance
\textbf{Seg}mentation in videos, namely \textbf{STC-Seg}. Concretely, STC-Seg
demonstrates four contributions. First, we leverage the complementary
representations from unsupervised depth estimation and optical flow to produce
effective pseudo-labels for training deep networks and predicting high-quality
instance masks. Second, to enhance the mask generation, we devise a puzzle
loss, which enables end-to-end training using box-level annotations. Third, our
tracking module jointly utilizes bounding-box diagonal points with
spatio-temporal discrepancy to model movements, which largely improves the
robustness to different object appearances. Finally, our framework is flexible
and enables image-level instance segmentation methods to operate the
video-level task. We conduct an extensive set of experiments on the KITTI MOTS
and YT-VIS datasets. Experimental results demonstrate that our method achieves
strong performance and even outperforms fully supervised TrackR-CNN and
MaskTrack R-CNN. We believe that STC-Seg can be a valuable addition to the
community, as it reflects the tip of an iceberg about the innovative
opportunities in the weakly supervised paradigm for instance segmentation in
videos.
Related papers
- What is Point Supervision Worth in Video Instance Segmentation? [119.71921319637748]
Video instance segmentation (VIS) is a challenging vision task that aims to detect, segment, and track objects in videos.
We reduce the human annotations to only one point for each object in a video frame during training, and obtain high-quality mask predictions close to fully supervised models.
Comprehensive experiments on three VIS benchmarks demonstrate competitive performance of the proposed framework, nearly matching fully supervised methods.
arXiv Detail & Related papers (2024-04-01T17:38:25Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Learning Referring Video Object Segmentation from Weak Annotation [78.45828085350936]
Referring video object segmentation (RVOS) is a task that aims to segment the target object in all video frames based on a sentence describing the object.
We propose a new annotation scheme that reduces the annotation effort by 8 times, while providing sufficient supervision for RVOS.
Our scheme only requires a mask for the frame where the object first appears and bounding boxes for the rest of the frames.
arXiv Detail & Related papers (2023-08-04T06:50:52Z) - Video Instance Segmentation by Instance Flow Assembly [23.001856276175506]
Bottom-up methods dealing with box-free features could offer accurate spacial correlations across frames.
We propose our framework equipped with a temporal context fusion module to better encode inter-frame correlations.
Experiments demonstrate that the proposed method outperforms the state-of-the-art online methods (taking image-level input) on the challenging Youtube-VIS dataset.
arXiv Detail & Related papers (2021-10-20T14:49:28Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - SG-Net: Spatial Granularity Network for One-Stage Video Instance
Segmentation [7.544917072241684]
Video instance segmentation (VIS) is a new and critical task in computer vision.
We propose a one-stage spatial granularity network (SG-Net) for VIS.
We show that our method can achieve improved performance in both accuracy and inference speed.
arXiv Detail & Related papers (2021-03-18T14:31:15Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.