RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory
- URL: http://arxiv.org/abs/2504.16471v1
- Date: Wed, 23 Apr 2025 07:31:37 GMT
- Title: RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory
- Authors: Boyue Xu, Ruichao Hou, Tongwei Ren, Gangshan Wu,
- Abstract summary: RGB-Depth (RGB-D) Video Object (VOS) aims to integrate the fine-grained texture information of RGB with the geometric clues of depth modality.<n>In this paper, we propose a novel RGB-D VOS via multi-store feature memory for robust segmentation.<n>We show that the proposed method state-of-the-art performance on the latest RGB-D VOS benchmark.
- Score: 34.406308400305385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The RGB-Depth (RGB-D) Video Object Segmentation (VOS) aims to integrate the fine-grained texture information of RGB with the spatial geometric clues of depth modality, boosting the performance of segmentation. However, off-the-shelf RGB-D segmentation methods fail to fully explore cross-modal information and suffer from object drift during long-term prediction. In this paper, we propose a novel RGB-D VOS method via multi-store feature memory for robust segmentation. Specifically, we design the hierarchical modality selection and fusion, which adaptively combines features from both modalities. Additionally, we develop a segmentation refinement module that effectively utilizes the Segmentation Anything Model (SAM) to refine the segmentation mask, ensuring more reliable results as memory to guide subsequent segmentation tasks. By leveraging spatio-temporal embedding and modality embedding, mixed prompts and fused images are fed into SAM to unleash its potential in RGB-D VOS. Experimental results show that the proposed method achieves state-of-the-art performance on the latest RGB-D VOS benchmark.
Related papers
- HDBFormer: Efficient RGB-D Semantic Segmentation with A Heterogeneous Dual-Branch Framework [0.0]
In RGB-D semantic segmentation for indoor scenes, a key challenge is effectively integrating the rich color information from RGB images with the spatial distance information from depth images.
We propose a novel heterogeneous dual-branch framework called HDBFormer, specifically designed to handle these modality differences.
For RGB images, which contain rich detail, we employ both a basic and detail encoder to extract local and global features.
For the simpler depth images, we propose LDFormer, a lightweight hierarchical encoder that efficiently extracts depth features with fewer parameters.
arXiv Detail & Related papers (2025-04-18T09:29:46Z) - IAM: Enhancing RGB-D Instance Segmentation with New Benchmarks [4.3266254914862445]
RGB-D segmentation promises richer scene understanding than RGB-only methods.<n>There is a relative scarcity of instance-level RGB-D segmentation datasets.<n>We introduce three RGB-D instance segmentation benchmarks, distinguished at the instance level.<n>We propose a simple yet effective method for RGB-D data integration.
arXiv Detail & Related papers (2025-01-03T08:03:24Z) - ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection [51.16181295385818]
We first collect an annotated RGB-D video SODOD (DSOD-100) dataset, which contains 100 videos within a total of 9,362 frames.
All the frames in each video are manually annotated to a high-quality saliency annotation.
We propose a new baseline model, named attentive triple-fusion network (ATF-Net) for RGB-D salient object detection.
arXiv Detail & Related papers (2024-06-18T12:09:43Z) - Optimizing rgb-d semantic segmentation through multi-modal interaction
and pooling attention [5.518612382697244]
Multi-modal Interaction and Pooling Attention Network (MIPANet) is designed to harness the interactive synergy between RGB and depth modalities.
We introduce a Pooling Attention Module (PAM) at various stages of the encoder.
This module serves to amplify the features extracted by the network and integrates the module's output into the decoder.
arXiv Detail & Related papers (2023-11-19T12:25:59Z) - SAD: Segment Any RGBD [54.24917975958583]
The Segment Anything Model (SAM) has demonstrated its effectiveness in segmenting any part of 2D RGB images.
We propose the Segment Any RGBD (SAD) model, which is specifically designed to extract geometry information directly from images.
arXiv Detail & Related papers (2023-05-23T16:26:56Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - FEANet: Feature-Enhanced Attention Network for RGB-Thermal Real-time
Semantic Segmentation [19.265576529259647]
We propose a two-stage Feature-Enhanced Attention Network (FEANet) for the RGB-T semantic segmentation task.
Specifically, we introduce a Feature-Enhanced Attention Module (FEAM) to excavate and enhance multi-level features from both the channel and spatial views.
Benefited from the proposed FEAM module, our FEANet can preserve the spatial information and shift more attention to high-resolution features from the fused RGB-T images.
arXiv Detail & Related papers (2021-10-18T02:43:41Z) - Cross-modality Discrepant Interaction Network for RGB-D Salient Object
Detection [78.47767202232298]
We propose a novel Cross-modality Discrepant Interaction Network (CDINet) for RGB-D SOD.
Two components are designed to implement the effective cross-modality interaction.
Our network outperforms $15$ state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-08-04T11:24:42Z) - Siamese Network for RGB-D Salient Object Detection and Beyond [113.30063105890041]
A novel framework is proposed to learn from both RGB and depth inputs through a shared network backbone.
Comprehensive experiments using five popular metrics show that the designed framework yields a robust RGB-D saliency detector.
We also link JL-DCF to the RGB-D semantic segmentation field, showing its capability of outperforming several semantic segmentation models.
arXiv Detail & Related papers (2020-08-26T06:01:05Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.