Iterative Knowledge Exchange Between Deep Learning and Space-Time
Spectral Clustering for Unsupervised Segmentation in Videos
- URL: http://arxiv.org/abs/2012.07123v1
- Date: Sun, 13 Dec 2020 18:36:18 GMT
- Title: Iterative Knowledge Exchange Between Deep Learning and Space-Time
Spectral Clustering for Unsupervised Segmentation in Videos
- Authors: Emanuela Haller, Adina Magda Florea and Marius Leordeanu
- Abstract summary: We propose a dual system for unsupervised object segmentation in video.
The first module is a space-time graph that discovers objects in videos.
The second module is a deep network that learns powerful object features.
- Score: 17.47403549514259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a dual system for unsupervised object segmentation in video, which
brings together two modules with complementary properties: a space-time graph
that discovers objects in videos and a deep network that learns powerful object
features. The system uses an iterative knowledge exchange policy. A novel
spectral space-time clustering process on the graph produces unsupervised
segmentation masks passed to the network as pseudo-labels. The net learns to
segment in single frames what the graph discovers in video and passes back to
the graph strong image-level features that improve its node-level features in
the next iteration. Knowledge is exchanged for several cycles until
convergence. The graph has one node per each video pixel, but the object
discovery is fast. It uses a novel power iteration algorithm computing the main
space-time cluster as the principal eigenvector of a special Feature-Motion
matrix without actually computing the matrix. The thorough experimental
analysis validates our theoretical claims and proves the effectiveness of the
cyclical knowledge exchange. We also perform experiments on the supervised
scenario, incorporating features pretrained with human supervision. We achieve
state-of-the-art level on unsupervised and supervised scenarios on four
challenging datasets: DAVIS, SegTrack, YouTube-Objects, and DAVSOD.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - Self-supervised Object-Centric Learning for Videos [39.02148880719576]
We propose the first fully unsupervised method for segmenting multiple objects in real-world sequences.
Our object-centric learning framework spatially binds objects to slots on each frame and then relates these slots across frames.
Our method can successfully segment multiple instances of complex and high-variety classes in YouTube videos.
arXiv Detail & Related papers (2023-10-10T18:03:41Z) - Learning a Fast 3D Spectral Approach to Object Segmentation and Tracking
over Space and Time [21.130594354306815]
We pose video object segmentation as spectral graph clustering in space and time.
We introduce a novel and efficient method based on 3D filtering for approximating the spectral solution.
We extend the formulation of our approach beyond the segmentation task, into the realm of object tracking.
arXiv Detail & Related papers (2022-12-15T18:59:07Z) - Multi-Granularity Graph Pooling for Video-based Person Re-Identification [14.943835935921296]
graph neural networks (GNNs) are introduced to aggregate temporal and spatial features of video samples.
Existing graph-based models, like STGCN, perform the textitmean/textitmax pooling on node features to obtain the graph representation.
We propose the graph pooling network (GPNet) to learn the multi-granularity graph representation for the video retrieval.
arXiv Detail & Related papers (2022-09-23T13:26:05Z) - End-to-end video instance segmentation via spatial-temporal graph neural
networks [30.748756362692184]
Video instance segmentation is a challenging task that extends image instance segmentation to the video domain.
Existing methods either rely only on single-frame information for the detection and segmentation subproblems or handle tracking as a separate post-processing step.
We propose a novel graph-neural-network (GNN) based method to handle the aforementioned limitation.
arXiv Detail & Related papers (2022-03-07T05:38:08Z) - Learning Multi-Granular Hypergraphs for Video-Based Person
Re-Identification [110.52328716130022]
Video-based person re-identification (re-ID) is an important research topic in computer vision.
We propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH) to better representational capabilities.
90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts schemes.
arXiv Detail & Related papers (2021-04-30T11:20:02Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - A Self-supervised Learning System for Object Detection in Videos Using
Random Walks on Graphs [20.369646864364547]
This paper presents a new self-supervised system for learning to detect novel and previously unseen categories of objects in images.
The proposed system receives as input several unlabeled videos of scenes containing various objects.
The frames of the videos are segmented into objects using depth information, and the segments are tracked along each video.
arXiv Detail & Related papers (2020-11-10T23:37:40Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z) - Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS)
AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges.
Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.