Segment Anything, Even Occluded
- URL: http://arxiv.org/abs/2503.06261v1
- Date: Sat, 08 Mar 2025 16:14:57 GMT
- Title: Segment Anything, Even Occluded
- Authors: Wei-En Tai, Yu-Lin Shih, Cheng Sun, Yu-Chiang Frank Wang, Hwann-Tzong Chen,
- Abstract summary: SAMEO is a novel framework that adapts the Segment Anything Model (SAM) as a versatile mask decoder.<n>We introduce Amodal-LVIS, a large-scale synthetic dataset comprising 300K images derived from the modal LVIS and LVVIS datasets.<n>Our results demonstrate that our approach, when trained on the newly extended dataset, achieves remarkable zero-shot performance on both COCOA-cls and D2SA benchmarks.
- Score: 35.150696061791805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Amodal instance segmentation, which aims to detect and segment both visible and invisible parts of objects in images, plays a crucial role in various applications including autonomous driving, robotic manipulation, and scene understanding. While existing methods require training both front-end detectors and mask decoders jointly, this approach lacks flexibility and fails to leverage the strengths of pre-existing modal detectors. To address this limitation, we propose SAMEO, a novel framework that adapts the Segment Anything Model (SAM) as a versatile mask decoder capable of interfacing with various front-end detectors to enable mask prediction even for partially occluded objects. Acknowledging the constraints of limited amodal segmentation datasets, we introduce Amodal-LVIS, a large-scale synthetic dataset comprising 300K images derived from the modal LVIS and LVVIS datasets. This dataset significantly expands the training data available for amodal segmentation research. Our experimental results demonstrate that our approach, when trained on the newly extended dataset, including Amodal-LVIS, achieves remarkable zero-shot performance on both COCOA-cls and D2SA benchmarks, highlighting its potential for generalization to unseen scenarios.
Related papers
- Unveiling the Invisible: Reasoning Complex Occlusions Amodally with AURA [49.10341970643037]
Amodal segmentation aims to infer the complete shape of occluded objects, even when the occluded region's appearance is unavailable.
Current amodal segmentation methods lack the capability to interact with users through text input.
We propose a novel task named amodal reasoning segmentation, aiming to predict the complete amodal shape of occluded objects.
arXiv Detail & Related papers (2025-03-13T10:08:18Z) - SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection [73.49799596304418]
This paper introduces a new task called Multi-Modal datasets and Multi-Task Object Detection (M2Det) for remote sensing.<n>It is designed to accurately detect horizontal or oriented objects from any sensor modality.<n>This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization.
arXiv Detail & Related papers (2024-12-30T02:47:51Z) - A2VIS: Amodal-Aware Approach to Video Instance Segmentation [8.082593574401704]
We propose a novel framework, Amodal-Aware Video Instance (A2VIS), which incorporates amodal representations to achieve a reliable comprehensive understanding of objects in video.<n>Amodal-Aware Video Instance (A2VIS) incorporates amodal representations to achieve a reliable comprehensive understanding of both visible and occluded parts of objects in video.
arXiv Detail & Related papers (2024-12-02T05:44:29Z) - Adapting Segment Anything Model to Multi-modal Salient Object Detection with Semantic Feature Fusion Guidance [15.435695491233982]
We propose a novel framework to explore and exploit the powerful feature representation and zero-shot generalization ability of the Segment Anything Model (SAM) for multi-modal salient object detection (SOD)
We develop underlineSAM with seunderlinemantic funderlineeature fuunderlinesion guidancunderlinee (Sammese)
In the image encoder, a multi-modal adapter is proposed to adapt the single-modal SAM to multi-modal information. Specifically, in the mask decoder, a semantic-geometric
arXiv Detail & Related papers (2024-08-27T13:47:31Z) - Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM)
This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - FM-ViT: Flexible Modal Vision Transformers for Face Anti-Spoofing [88.6654909354382]
We present a pure transformer-based framework, dubbed the Flexible Modal Vision Transformer (FM-ViT) for face anti-spoofing.
FM-ViT can flexibly target any single-modal (i.e., RGB) attack scenarios with the help of available multi-modal data.
Experiments demonstrate that the single model trained based on FM-ViT can not only flexibly evaluate different modal samples, but also outperforms existing single-modal frameworks by a large margin.
arXiv Detail & Related papers (2023-05-05T04:28:48Z) - Amodal Panoptic Segmentation [13.23676270963484]
We formulate and propose a novel task that we name amodal panoptic segmentation.
The goal of this task is to simultaneously predict the pixel-wise semantic segmentation labels of the visible regions of stuff classes.
We propose the novel amodal panoptic segmentation network (APSNet) as a first step towards addressing this task.
arXiv Detail & Related papers (2022-02-23T14:41:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.