Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation
- URL: http://arxiv.org/abs/2507.22465v1
- Date: Wed, 30 Jul 2025 08:11:18 GMT
- Title: Shallow Features Matter: Hierarchical Memory with Heterogeneous Interaction for Unsupervised Video Object Segmentation
- Authors: Zheng Xiangyu, He Songcheng, Li Wanyun, Li Xiaoqiang, Zhang Wei,
- Abstract summary: Unsupervised Video Object (UVOS) aims to predict pixel-level masks for the most salient objects in videos without any prior annotations.<n>Our analysis reveals a simple but fundamental flaw in existing methods: over-reliance on memorizing high-level semantic features.<n>We propose a novel hierarchical memory architecture to incorporate both shallow- and high-level features for memory.
- Score: 1.5223740593989445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unsupervised Video Object Segmentation (UVOS) aims to predict pixel-level masks for the most salient objects in videos without any prior annotations. While memory mechanisms have been proven critical in various video segmentation paradigms, their application in UVOS yield only marginal performance gains despite sophisticated design. Our analysis reveals a simple but fundamental flaw in existing methods: over-reliance on memorizing high-level semantic features. UVOS inherently suffers from the deficiency of lacking fine-grained information due to the absence of pixel-level prior knowledge. Consequently, memory design relying solely on high-level features, which predominantly capture abstract semantic cues, is insufficient to generate precise predictions. To resolve this fundamental issue, we propose a novel hierarchical memory architecture to incorporate both shallow- and high-level features for memory, which leverages the complementary benefits of pixel and semantic information. Furthermore, to balance the simultaneous utilization of the pixel and semantic memory features, we propose a heterogeneous interaction mechanism to perform pixel-semantic mutual interactions, which explicitly considers their inherent feature discrepancies. Through the design of Pixel-guided Local Alignment Module (PLAM) and Semantic-guided Global Integration Module (SGIM), we achieve delicate integration of the fine-grained details in shallow-level memory and the semantic representations in high-level memory. Our Hierarchical Memory with Heterogeneous Interaction Network (HMHI-Net) consistently achieves state-of-the-art performance across all UVOS and video saliency detection benchmarks. Moreover, HMHI-Net consistently exhibits high performance across different backbones, further demonstrating its superiority and robustness. Project page: https://github.com/ZhengxyFlow/HMHI-Net .
Related papers
- Topology-aware Embedding Memory for Continual Learning on Expanding Networks [63.35819388164267]
We present a framework to tackle the memory explosion problem using memory replay techniques.
PDGNNs with Topology-aware Embedding Memory (TEM) significantly outperform state-of-the-art techniques.
arXiv Detail & Related papers (2024-01-24T03:03:17Z) - Memory-Constrained Semantic Segmentation for Ultra-High Resolution UAV
Imagery [35.96063342025938]
This paper explores the intricate problem of achieving efficient and effective segmentation of ultra-high resolution UAV imagery.
We propose a GPU memory-efficient and effective framework for local inference without accessing the context beyond local patches.
We present an efficient memory-based interaction scheme to correct potential semantic bias of the underlying high-resolution information.
arXiv Detail & Related papers (2023-10-07T07:44:59Z) - Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation [47.7036344302777]
Current Object Video reference methods follow the pipeline of extraction-then-matching.<n>We propose a unified VOS framework, coined as JointFormer, for jointly feature modeling, correspondence, and a compressed memory.
arXiv Detail & Related papers (2023-08-25T17:30:08Z) - Holistic Prototype Attention Network for Few-Shot VOS [74.25124421163542]
Few-shot video object segmentation (FSVOS) aims to segment dynamic objects of unseen classes by resorting to a small set of support images.
We propose a holistic prototype attention network (HPAN) for advancing FSVOS.
arXiv Detail & Related papers (2023-07-16T03:48:57Z) - Unsupervised Video Object Segmentation via Prototype Memory Network [5.612292166628669]
Unsupervised video object segmentation aims to segment a target object in the video without a ground truth mask in the initial frame.
This challenge requires extracting features for the most salient common objects within a video sequence.
We propose a novel prototype memory network architecture to solve this problem.
arXiv Detail & Related papers (2022-09-08T11:08:58Z) - In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z) - MUNet: Motion Uncertainty-aware Semi-supervised Video Object
Segmentation [31.100954335785026]
We advocate the return of the emphmotion information and propose a motion uncertainty-aware framework (MUNet) for semi-supervised video object segmentation.
We introduce a motion-aware spatial attention module to effectively fuse the motion feature with the semantic feature.
We achieve $76.5%$ $mathcalJ & mathcalF$ only using DAVIS17 for training, which significantly outperforms the textitSOTA methods under the low-data protocol.
arXiv Detail & Related papers (2021-11-29T16:01:28Z) - Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation [68.45737688496654]
We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
arXiv Detail & Related papers (2021-06-09T16:50:57Z) - Video Object Segmentation with Episodic Graph Memory Networks [198.74780033475724]
A graph memory network is developed to address the novel idea of "learning to update the segmentation model"
We exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges.
The proposed graph memory network yields a neat yet principled framework, which can generalize well both one-shot and zero-shot video object segmentation tasks.
arXiv Detail & Related papers (2020-07-14T13:19:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.