SimulFlow: Simultaneously Extracting Feature and Identifying Target for
Unsupervised Video Object Segmentation
- URL: http://arxiv.org/abs/2311.18286v1
- Date: Thu, 30 Nov 2023 06:44:44 GMT
- Title: SimulFlow: Simultaneously Extracting Feature and Identifying Target for
Unsupervised Video Object Segmentation
- Authors: Lingyi Hong, Wei Zhang, Shuyong Gao, Hong Lu, WenQiang Zhang
- Abstract summary: Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing.
Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks.
We propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification.
- Score: 28.19471998380114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised video object segmentation (UVOS) aims at detecting the primary
objects in a given video sequence without any human interposing. Most existing
methods rely on two-stream architectures that separately encode the appearance
and motion information before fusing them to identify the target and generate
object masks. However, this pipeline is computationally expensive and can lead
to suboptimal performance due to the difficulty of fusing the two modalities
properly. In this paper, we propose a novel UVOS model called SimulFlow that
simultaneously performs feature extraction and target identification, enabling
efficient and effective unsupervised video object segmentation. Concretely, we
design a novel SimulFlow Attention mechanism to bridege the image and motion by
utilizing the flexibility of attention operation, where coarse masks predicted
from fused feature at each stage are used to constrain the attention operation
within the mask area and exclude the impact of noise. Because of the
bidirectional information flow between visual and optical flow features in
SimulFlow Attention, no extra hand-designed fusing module is required and we
only adopt a light decoder to obtain the final prediction. We evaluate our
method on several benchmark datasets and achieve state-of-the-art results. Our
proposed approach not only outperforms existing methods but also addresses the
computational complexity and fusion difficulties caused by two-stream
architectures. Our models achieve 87.4% J & F on DAVIS-16 with the highest
speed (63.7 FPS on a 3090) and the lowest parameters (13.7 M). Our SimulFlow
also obtains competitive results on video salient object detection datasets.
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Efficient Unsupervised Video Object Segmentation Network Based on Motion
Guidance [1.5736899098702974]
This paper proposes a video object segmentation network based on motion guidance.
The model comprises a dual-stream network, motion guidance module, and multi-scale progressive fusion module.
The experimental results prove the superior performance of the proposed method.
arXiv Detail & Related papers (2022-11-10T06:13:23Z) - It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - Motion-inductive Self-supervised Object Discovery in Videos [99.35664705038728]
We propose a model for processing consecutive RGB frames, and infer the optical flow between any pair of frames using a layered representation.
We demonstrate superior performance over previous state-of-the-art methods on three public video segmentation datasets.
arXiv Detail & Related papers (2022-10-01T08:38:28Z) - Hierarchical Feature Alignment Network for Unsupervised Video Object
Segmentation [99.70336991366403]
We propose a concise, practical, and efficient architecture for appearance and motion feature alignment.
The proposed HFAN reaches a new state-of-the-art performance on DAVIS-16, achieving 88.7 $mathcalJ&mathcalF$ Mean, i.e., a relative improvement of 3.5% over the best published result.
arXiv Detail & Related papers (2022-07-18T10:10:14Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - Efficient Two-Stream Network for Violence Detection Using Separable
Convolutional LSTM [0.0]
We propose an efficient two-stream deep learning architecture leveraging Separable Convolutional LSTM (SepConvLSTM) and pre-trained MobileNet.
SepConvLSTM is constructed by replacing convolution operation at each gate of ConvLSTM with a depthwise separable convolution.
Our model outperforms the accuracy on the larger and more challenging RWF-2000 dataset by more than a 2% margin.
arXiv Detail & Related papers (2021-02-21T12:01:48Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.