Weakly Supervised Video Salient Object Detection via Point Supervision
- URL: http://arxiv.org/abs/2207.07269v1
- Date: Fri, 15 Jul 2022 03:31:15 GMT
- Title: Weakly Supervised Video Salient Object Detection via Point Supervision
- Authors: Shuyong Gao, Haozhe Xing, Wei Zhang, Yan Wang, Qianyu Guo, Wenqiang
Zhang
- Abstract summary: We propose a strong baseline model based on point supervision.
To infer saliency maps with temporal information, we mine inter-frame complementary information from short-term and long-term perspectives.
We label two point-supervised datasets, P-DAVIS and P-DAVSOD, by relabeling the DAVIS and the DAVSOD dataset.
- Score: 18.952253968878356
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video salient object detection models trained on pixel-wise dense annotation
have achieved excellent performance, yet obtaining pixel-by-pixel annotated
datasets is laborious. Several works attempt to use scribble annotations to
mitigate this problem, but point supervision as a more labor-saving annotation
method (even the most labor-saving method among manual annotation methods for
dense prediction), has not been explored. In this paper, we propose a strong
baseline model based on point supervision. To infer saliency maps with temporal
information, we mine inter-frame complementary information from short-term and
long-term perspectives, respectively. Specifically, we propose a hybrid token
attention module, which mixes optical flow and image information from
orthogonal directions, adaptively highlighting critical optical flow
information (channel dimension) and critical token information (spatial
dimension). To exploit long-term cues, we develop the Long-term Cross-Frame
Attention module (LCFA), which assists the current frame in inferring salient
objects based on multi-frame tokens. Furthermore, we label two point-supervised
datasets, P-DAVIS and P-DAVSOD, by relabeling the DAVIS and the DAVSOD dataset.
Experiments on the six benchmark datasets illustrate our method outperforms the
previous state-of-the-art weakly supervised methods and even is comparable with
some fully supervised approaches. Source code and datasets are available.
Related papers
- LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry [52.131996528655094]
We present the Long-term Effective Any Point Tracking (LEAP) module.
LEAP innovatively combines visual, inter-track, and temporal cues with mindfully selected anchors for dynamic track estimation.
Based on these traits, we develop LEAP-VO, a robust visual odometry system adept at handling occlusions and dynamic scenes.
arXiv Detail & Related papers (2024-01-03T18:57:27Z) - SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations [12.139451002212063]
SSVOD exploits motion dynamics of videos to utilize large-scale unlabeled frames with sparse annotations.
Our method achieves significant performance improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS.
arXiv Detail & Related papers (2023-09-04T06:41:33Z) - Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections.
We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing.
We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z) - Tsanet: Temporal and Scale Alignment for Unsupervised Video Object
Segmentation [21.19216164433897]
Unsupervised Video Object (UVOS) refers to the challenging task of segmenting the prominent object in videos without manual guidance.
We propose a novel framework for UVOS that can address the aforementioned limitations of the two approaches.
We present experimental results on public benchmark datasets, DAVIS 2016 and FBMS, which demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-03-08T04:59:43Z) - 3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D
Point Clouds [95.54285993019843]
We propose a method for joint detection and tracking of multiple objects in 3D point clouds.
Our model exploits temporal information employing multiple frames to detect objects and track them in a single network.
arXiv Detail & Related papers (2022-11-01T20:59:38Z) - Video Shadow Detection via Spatio-Temporal Interpolation Consistency
Training [31.115226660100294]
We propose a framework to feed the unlabeled video frames together with the labeled images into an image shadow detection network training.
We then derive the spatial and temporal consistency constraints accordingly for enhancing generalization in the pixel-wise classification.
In addition, we design a Scale-Aware Network for multi-scale shadow knowledge learning in images.
arXiv Detail & Related papers (2022-06-17T14:29:51Z) - Learning Video Salient Object Detection Progressively from Unlabeled
Videos [8.224670666756193]
We propose a novel VSOD method via a progressive framework that locates and segments salient objects in sequence without utilizing any video annotation.
Specifically, an algorithm for generating deeptemporal location labels, which consists of generating high-saliency location labels and tracking salient objects in adjacent frames, is proposed.
Although our method does not require labeled video at all, the experimental results on five public benchmarks of DAVIS, FBMS, ViSal, VOS, and DAVSOD demonstrate that our proposed method is competitive with fully supervised methods and outperforms the state-of-the-art weakly and unsupervised methods.
arXiv Detail & Related papers (2022-04-05T06:12:45Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Weakly Supervised Video Salient Object Detection [79.51227350937721]
We present the first weakly supervised video salient object detection model based on relabeled "fixation guided scribble annotations"
An "Appearance-motion fusion module" and bidirectional ConvLSTM based framework are proposed to achieve effective multi-modal learning and long-term temporal context modeling.
arXiv Detail & Related papers (2021-04-06T09:48:38Z) - Densely Nested Top-Down Flows for Salient Object Detection [137.74130900326833]
This paper revisits the role of top-down modeling in salient object detection.
It designs a novel densely nested top-down flows (DNTDF)-based framework.
In every stage of DNTDF, features from higher levels are read in via the progressive compression shortcut paths (PCSP)
arXiv Detail & Related papers (2021-02-18T03:14:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.