Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos
- URL: http://arxiv.org/abs/2308.09951v2
- Date: Thu, 21 Mar 2024 09:28:34 GMT
- Title: Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos
- Authors: Rui Qian, Shuangrui Ding, Xian Liu, Dahua Lin,
- Abstract summary: Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence.
We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps.
We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
- Score: 63.94040814459116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. Building on these results, we take one step further and explore the possibility of integrating these two features to enhance object-centric representations. Our preliminary experiments indicate that query slot attention can extract different semantic components from the RGB feature map, while random sampling based slot attention can exploit temporal correspondence cues between frames to assist instance identification. Motivated by this, we propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. It comprises two slot attention stages with a set of shared learnable Gaussian distributions. In the first stage, we use the mean vectors as slot initialization to decompose potential semantics and generate semantic segmentation masks through iterative attention. In the second stage, for each semantics, we randomly sample slots from the corresponding Gaussian distribution and perform masked feature aggregation within the semantic area to exploit temporal correspondence patterns for instance identification. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations. Our model effectively identifies multiple object instances with semantic structure, reaching promising results on unsupervised video object discovery. Furthermore, we achieve state-of-the-art performance on dense label propagation tasks, demonstrating the potential for object-centric analysis. The code is released at https://github.com/shvdiwnkozbw/SMTC.
Related papers
- Mutually-Aware Feature Learning for Few-Shot Object Counting [20.623402944601775]
Few-shot object counting has garnered significant attention for its practicality as it aims to count target objects in a query image based on given exemplars without the need for additional training.
We propose a novel framework, Mutually-Aware FEAture learning(MAFEA), which encodes query and exemplar features mutually aware of each other from the outset.
Our model reaches a new state-of-the-art performance on the two challenging benchmarks, FSCD-LVIS and FSC-147, with a remarkably reduced degree of the target confusion problem.
arXiv Detail & Related papers (2024-08-19T06:46:24Z) - Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised
Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps.
We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z) - FM-Fusion: Instance-aware Semantic Mapping Boosted by Vision-Language Foundation Models [24.77953131753715]
Development of vision-language foundation models demonstrates a strong zero-shot transferability across data distribution.
We propose a probabilistic label fusion method to predict close-set semantic classes from open-set label measurements.
We integrate all the modules into a unified semantic mapping system. Reading a sequence of RGB-D input, our work incrementally reconstructs an instance-aware semantic map.
arXiv Detail & Related papers (2024-02-07T03:19:02Z) - Object-Centric Multiple Object Tracking [124.30650395969126]
This paper proposes a video object-centric model for multiple-object tracking pipelines.
It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module.
Benefited from object-centric learning, we only require sparse detection labels for object localization and feature binding.
arXiv Detail & Related papers (2023-09-01T03:34:12Z) - LEAD: Self-Supervised Landmark Estimation by Aligning Distributions of
Feature Similarity [49.84167231111667]
Existing works in self-supervised landmark detection are based on learning dense (pixel-level) feature representations from an image.
We introduce an approach to enhance the learning of dense equivariant representations in a self-supervised fashion.
We show that having such a prior in the feature extractor helps in landmark detection, even under drastically limited number of annotations.
arXiv Detail & Related papers (2022-04-06T17:48:18Z) - Detecting Human-Object Interactions with Object-Guided Cross-Modal
Calibrated Semantics [6.678312249123534]
We aim to boost end-to-end models with object-guided statistical priors.
We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy.
The above modules combined composes Object-guided Cross-modal Network (OCN)
arXiv Detail & Related papers (2022-02-01T07:39:04Z) - Target-Aware Object Discovery and Association for Unsupervised Video
Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation.
We introduce a novel approach for more accurate and efficient unseen-temporal segmentation.
We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.