Related papers: Self-supervised Video Object Segmentation by Motion Grouping

Self-supervised Video Object Segmentation by Motion Grouping

URL: http://arxiv.org/abs/2104.07658v1
Date: Thu, 15 Apr 2021 17:59:32 GMT
Title: Self-supervised Video Object Segmentation by Motion Grouping
Authors: Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, Weidi Xie
Abstract summary: We develop a computer vision system able to segment objects by exploiting motion cues. We introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background. We evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59)
Score: 79.13206959575228
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Animals have evolved highly functional visual systems to understand motion, assisting perception even under complex environments. In this paper, we work towards developing a computer vision system able to segment objects by exploiting motion cues, i.e. motion segmentation. We make the following contributions: First, we introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background. Second, we train the architecture in a self-supervised manner, i.e. without using any manual annotations. Third, we analyze several critical components of our method and conduct thorough ablation studies to validate their necessity. Fourth, we evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59). Despite using only optical flow as input, our approach achieves superior or comparable results to previous state-of-the-art self-supervised methods, while being an order of magnitude faster. We additionally evaluate on a challenging camouflage dataset (MoCA), significantly outperforming the other self-supervised approaches, and comparing favourably to the top supervised approach, highlighting the importance of motion cues, and the potential bias towards visual appearance in existing video segmentation models.

Related papers

Object Concepts Emerge from Motion [24.73461163778215]
We propose a biologically inspired framework for learning object-centric visual representations in an unsupervised manner.<n>Our key insight is that motion boundary serves as a strong signal for object-level grouping.<n>Our framework is fully label-free and does not rely on camera calibration, making it scalable to large-scale unstructured video data.
arXiv Detail & Related papers (2025-05-27T18:09:02Z)
MonoMobility: Zero-Shot 3D Mobility Analysis from Monocular Videos [43.906631899750906]
We propose an innovative framework that can analyze 3D mobility from monocular videos in a zero-shot manner.<n>This framework can precisely parse motion parts and motion attributes only using a monocular video, completely eliminating the need for annotated training data.<n>Building on this, we introduce an end-to-end dynamic scene optimization algorithm specifically designed for articulated objects.
arXiv Detail & Related papers (2025-05-17T06:21:05Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
LOCATE: Self-supervised Object Discovery via Flow-guided Graph-cut and Bootstrapped Self-training [13.985488693082981]
We propose a self-supervised object discovery approach that leverages motion and appearance information to produce high-quality object segmentation masks. We demonstrate the effectiveness of our approach, named LOCATE, on multiple standard video object segmentation, image saliency detection, and object segmentation benchmarks.
arXiv Detail & Related papers (2023-08-22T07:27:09Z)
Dyna-DepthFormer: Multi-frame Transformer for Self-Supervised Depth Estimation in Dynamic Scenes [19.810725397641406]
We propose a novel Dyna-Depthformer framework, which predicts scene depth and 3D motion field jointly. Our contributions are two-fold. First, we leverage multi-view correlation through a series of self- and cross-attention layers in order to obtain enhanced depth feature representation. Second, we propose a warping-based Motion Network to estimate the motion field of dynamic objects without using semantic prior.
arXiv Detail & Related papers (2023-01-14T09:43:23Z)
Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation. We introduce a scalable pipeline for generating synthetic training data with multiple objects. We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z)
Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation [76.58256020932312]
Estimating the motion of the camera together with the 3D structure of the scene from a monocular vision system is a complex task. We present a self-supervised learning framework for 3D object motion field estimation from monocular videos.
arXiv Detail & Related papers (2021-10-13T16:45:01Z)
Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field. It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations. Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)
Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking. Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.