F2Net: Learning to Focus on the Foreground for Unsupervised Video Object
Segmentation
- URL: http://arxiv.org/abs/2012.02534v1
- Date: Fri, 4 Dec 2020 11:30:50 GMT
- Title: F2Net: Learning to Focus on the Foreground for Unsupervised Video Object
Segmentation
- Authors: Daizong Liu, Dongdong Yu, Changhu Wang, Pan Zhou
- Abstract summary: We propose a novel Focus on Foreground Network (F2Net), which delves into the intra-inter frame details for the foreground objects.
Our proposed network consists of three main parts: Siamese Module, Center Guiding Appearance Diffusion Module, and Dynamic Information Fusion Module.
Experiments on DAVIS2016, Youtube-object, and FBMS datasets show that our proposed F2Net achieves the state-of-the-art performance with significant improvement.
- Score: 61.74261802856947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although deep learning based methods have achieved great progress in
unsupervised video object segmentation, difficult scenarios (e.g., visual
similarity, occlusions, and appearance changing) are still not well-handled. To
alleviate these issues, we propose a novel Focus on Foreground Network (F2Net),
which delves into the intra-inter frame details for the foreground objects and
thus effectively improve the segmentation performance. Specifically, our
proposed network consists of three main parts: Siamese Encoder Module, Center
Guiding Appearance Diffusion Module, and Dynamic Information Fusion Module.
Firstly, we take a siamese encoder to extract the feature representations of
paired frames (reference frame and current frame). Then, a Center Guiding
Appearance Diffusion Module is designed to capture the inter-frame feature
(dense correspondences between reference frame and current frame), intra-frame
feature (dense correspondences in current frame), and original semantic feature
of current frame. Specifically, we establish a Center Prediction Branch to
predict the center location of the foreground object in current frame and
leverage the center point information as spatial guidance prior to enhance the
inter-frame and intra-frame feature extraction, and thus the feature
representation considerably focus on the foreground objects. Finally, we
propose a Dynamic Information Fusion Module to automatically select relatively
important features through three aforementioned different level features.
Extensive experiments on DAVIS2016, Youtube-object, and FBMS datasets show that
our proposed F2Net achieves the state-of-the-art performance with significant
improvement.
Related papers
- PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - STF: Spatio-Temporal Fusion Module for Improving Video Object Detection [7.213855322671065]
Consive frames in a video contain redundancy, but they may also contain complementary information for the detection task.
We propose a-temporal fusion framework (STF) to leverage this complementary information.
The proposed-temporal fusion module leads to improved detection performance compared to baseline object detectors.
arXiv Detail & Related papers (2024-02-16T15:19:39Z) - Rethinking Amodal Video Segmentation from Learning Supervised Signals
with Object-centric Representation [47.39455910191075]
Video amodal segmentation is a challenging task in computer vision.
Recent studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting.
This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation.
arXiv Detail & Related papers (2023-09-23T04:12:02Z) - FoV-Net: Field-of-View Extrapolation Using Self-Attention and
Uncertainty [95.11806655550315]
We utilize information from a video sequence with a narrow field-of-view to infer the scene at a wider field-of-view.
We propose a temporally consistent field-of-view extrapolation framework, namely FoV-Net.
Experiments show that FoV-Net does not only extrapolate the temporally consistent wide field-of-view scene better than existing alternatives.
arXiv Detail & Related papers (2022-04-04T06:24:03Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - EA-Net: Edge-Aware Network for Flow-based Video Frame Interpolation [101.75999290175412]
We propose to reduce the image blur and get the clear shape of objects by preserving the edges in the interpolated frames.
The proposed Edge-Aware Network (EANet) integrates the edge information into the frame task.
Three edge-aware mechanisms are developed to emphasize the frame edges in estimating flow maps.
arXiv Detail & Related papers (2021-05-17T08:44:34Z) - (AF)2-S3Net: Attentive Feature Fusion with Adaptive Feature Selection
for Sparse Semantic Segmentation Network [3.6967381030744515]
We propose AF2-S3Net, an end-to-end encoder-decoder CNN network for 3D LiDAR semantic segmentation.
We present a novel multi-branch attentive feature fusion module in the encoder and a unique adaptive feature selection module with feature map re-weighting in the decoder.
Our experimental results show that the proposed method outperforms the state-of-the-art approaches on the large-scale SemanticKITTI benchmark.
arXiv Detail & Related papers (2021-02-08T21:04:21Z) - Multi-View Adaptive Fusion Network for 3D Object Detection [14.506796247331584]
3D object detection based on LiDAR-camera fusion is becoming an emerging research theme for autonomous driving.
We propose a single-stage multi-view fusion framework that takes LiDAR bird's-eye view, LiDAR range view and camera view images as inputs for 3D object detection.
We design an end-to-end learnable network named MVAF-Net to integrate these two components.
arXiv Detail & Related papers (2020-11-02T00:06:01Z) - ASAP-Net: Attention and Structure Aware Point Cloud Sequence
Segmentation [49.15948235059343]
We further improve point-temporal cloud feature with a flexible module called ASAP.
Our ASAP module contains an attentive temporal embedding layer to fuse the relatively informative local features across frames in a recurrent fashion.
We show the generalization ability of the proposed ASAP module with different computation backbone networks for point cloud sequence segmentation.
arXiv Detail & Related papers (2020-08-12T07:37:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.