Motion-inductive Self-supervised Object Discovery in Videos
- URL: http://arxiv.org/abs/2210.00221v1
- Date: Sat, 1 Oct 2022 08:38:28 GMT
- Title: Motion-inductive Self-supervised Object Discovery in Videos
- Authors: Shuangrui Ding, Weidi Xie, Yabo Chen, Rui Qian, Xiaopeng Zhang,
Hongkai Xiong, Qi Tian
- Abstract summary: We propose a model for processing consecutive RGB frames, and infer the optical flow between any pair of frames using a layered representation.
We demonstrate superior performance over previous state-of-the-art methods on three public video segmentation datasets.
- Score: 99.35664705038728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we consider the task of unsupervised object discovery in
videos. Previous works have shown promising results via processing optical
flows to segment objects. However, taking flow as input brings about two
drawbacks. First, flow cannot capture sufficient cues when objects remain
static or partially occluded. Second, it is challenging to establish temporal
coherency from flow-only input, due to the missing texture information. To
tackle these limitations, we propose a model for directly processing
consecutive RGB frames, and infer the optical flow between any pair of frames
using a layered representation, with the opacity channels being treated as the
segmentation. Additionally, to enforce object permanence, we apply temporal
consistency loss on the inferred masks from randomly-paired frames, which refer
to the motions at different paces, and encourage the model to segment the
objects even if they may not move at the current time point. Experimentally, we
demonstrate superior performance over previous state-of-the-art methods on
three public video segmentation datasets (DAVIS2016, SegTrackv2, and FBMS-59),
while being computationally efficient by avoiding the overhead of computing
optical flow as input.
Related papers
- Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - SimulFlow: Simultaneously Extracting Feature and Identifying Target for
Unsupervised Video Object Segmentation [28.19471998380114]
Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing.
Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks.
We propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification.
arXiv Detail & Related papers (2023-11-30T06:44:44Z) - Tsanet: Temporal and Scale Alignment for Unsupervised Video Object
Segmentation [21.19216164433897]
Unsupervised Video Object (UVOS) refers to the challenging task of segmenting the prominent object in videos without manual guidance.
We propose a novel framework for UVOS that can address the aforementioned limitations of the two approaches.
We present experimental results on public benchmark datasets, DAVIS 2016 and FBMS, which demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-03-08T04:59:43Z) - Motion-aware Memory Network for Fast Video Salient Object Detection [15.967509480432266]
We design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD.
In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames.
In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches.
The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
arXiv Detail & Related papers (2022-08-01T15:56:19Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - FlowVOS: Weakly-Supervised Visual Warping for Detail-Preserving and
Temporally Consistent Single-Shot Video Object Segmentation [4.3171602814387136]
We introduce a new foreground-targeted visual warping approach that learns flow fields from VOS data.
We train a flow module to capture detailed motion between frames using two weakly-supervised losses.
Our approach produces segmentations with high detail and temporal consistency.
arXiv Detail & Related papers (2021-11-20T16:17:10Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.