Improving SAM for Camouflaged Object Detection via Dual Stream Adapters
- URL: http://arxiv.org/abs/2503.06042v2
- Date: Fri, 28 Mar 2025 03:12:15 GMT
- Title: Improving SAM for Camouflaged Object Detection via Dual Stream Adapters
- Authors: Jiaming Liu, Linghe Kong, Guihai Chen,
- Abstract summary: Segment anything model (SAM) has shown impressive general-purpose segmentation performance on natural images.<n>We propose SAM-COD that performs camouflaged object detection for RGB-D inputs.
- Score: 48.14077145912842
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Segment anything model (SAM) has shown impressive general-purpose segmentation performance on natural images, but its performance on camouflaged object detection (COD) is unsatisfactory. In this paper, we propose SAM-COD that performs camouflaged object detection for RGB-D inputs. While keeping the SAM architecture intact, dual stream adapters are expanded on the image encoder to learn potential complementary information from RGB images and depth images, and fine-tune the mask decoder and its depth replica to perform dual-stream mask prediction. In practice, the dual stream adapters are embedded into the attention block of the image encoder in a parallel manner to facilitate the refinement and correction of the two types of image embeddings. To mitigate channel discrepancies arising from dual stream embeddings that do not directly interact with each other, we augment the association of dual stream embeddings using bidirectional knowledge distillation including a model distiller and a modal distiller. In addition, to predict the masks for RGB and depth attention maps, we hybridize the two types of image embeddings which are jointly learned with the prompt embeddings to update the initial prompt, and then feed them into the mask decoders to synchronize the consistency of image embeddings and prompt embeddings. Experimental results on four COD benchmarks show that our SAM-COD achieves excellent detection performance gains over SAM and achieves state-of-the-art results with a given fine-tuning paradigm.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Neuromorphic Synergy for Video Binarization [54.195375576583864]
Bimodal objects serve as a visual form to embed information that can be easily recognized by vision systems.
Neuromorphic cameras offer new capabilities for alleviating motion blur, but it is non-trivial to first de-blur and then binarize the images in a real-time manner.
We propose an event-based binary reconstruction method that leverages the prior knowledge of the bimodal target's properties to perform inference independently in both event space and image space.
We also develop an efficient integration method to propagate this binary image to high frame rate binary video.
arXiv Detail & Related papers (2024-02-20T01:43:51Z) - CAT-SAM: Conditional Tuning for Few-Shot Adaptation of Segment Anything Model [90.26396410706857]
This paper presents CAT-SAM, a ConditionAl Tuning network that adapts SAM toward various unconventional target tasks.
CAT-SAM freezes the entire SAM and adapts its mask decoder and image encoder simultaneously with a small number of learnable parameters.
Cat-SAM variants achieve superior target segmentation performance consistently even under the very challenging one-shot adaptation setup.
arXiv Detail & Related papers (2024-02-06T02:00:18Z) - Mask-adaptive Gated Convolution and Bi-directional Progressive Fusion Network for Depth Completion [3.5940515868907164]
We propose a new model for depth completion based on an encoder-decoder structure.<n>Our model introduces two key components: the Mask-adaptive Gated Convolution architecture and the Bi-directional Progressive Fusion module.<n>We achieve remarkable performance in completing depth maps and outperformed existing approaches in terms of accuracy and reliability.
arXiv Detail & Related papers (2024-01-15T02:58:06Z) - DVANet: Disentangling View and Action Features for Multi-View Action
Recognition [56.283944756315066]
We present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video.
Our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets.
arXiv Detail & Related papers (2023-12-10T01:19:48Z) - Dual-Stream Attention Transformers for Sewer Defect Classification [2.5499055723658097]
We propose a dual-stream vision transformer architecture that processes RGB and optical flow inputs for efficient sewer defect classification.
Our key idea is to use self-attention regularization to harness the complementary strengths of the RGB and motion streams.
By leveraging motion cues through a self-attention regularizer, we align and enhance RGB attention maps, enabling the network to concentrate on pertinent input regions.
arXiv Detail & Related papers (2023-11-07T02:31:51Z) - DMDC: Dynamic-mask-based dual camera design for snapshot Hyperspectral
Imaging [3.3946853660795884]
We present a dynamic-mask-based dual camera system, which consists of an RGB camera and a CASSI system running in parallel.
First, the system learns the spatial feature distribution of the scene based on the RGB images, then instructs the SLM to encode each scene, and finally sends both RGB and CASSI images to the network for reconstruction.
We further designed the DMDC-net, which consists of two separate networks, a small-scale CNN-based dynamic mask network for dynamic adjustment of the mask and a multimodal reconstruction network for reconstruction using RGB and CASSI measurements.
arXiv Detail & Related papers (2023-08-03T05:10:58Z) - Dual-view Snapshot Compressive Imaging via Optical Flow Aided Recurrent
Neural Network [14.796204921975733]
Dual-view snapshot compressive imaging (SCI) aims to capture videos from two field-of-views (FoVs) in a single snapshot.
It is challenging for existing model-based decoding algorithms to reconstruct each individual scene.
We propose an optical flow-aided recurrent neural network for dual video SCI systems, which provides high-quality decoding in seconds.
arXiv Detail & Related papers (2021-09-11T14:24:44Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - Two-stream Encoder-Decoder Network for Localizing Image Forgeries [4.982505311411925]
We propose a novel two-stream encoder-decoder network, which utilizes both the high-level and the low-level image features.
We have carried out experimental analysis on multiple standard forensics datasets to evaluate the performance of the proposed method.
arXiv Detail & Related papers (2020-09-27T15:49:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.