Retrospective Memory for Camouflaged Object Detection
- URL: http://arxiv.org/abs/2506.15244v1
- Date: Wed, 18 Jun 2025 08:22:19 GMT
- Title: Retrospective Memory for Camouflaged Object Detection
- Authors: Chenxi Zhang, Jiayun Wu, Qing Zhang, Yazhe Zhai, Youwei Pang,
- Abstract summary: We propose a recall-augmented COD architecture, namely RetroMem, which dynamically modulates camouflage pattern perception and inference.<n>In the recall stage, we propose a dynamic memory mechanism and an inference pattern reconstruction.<n>Our RetroMem significantly outperforms existing state-of-the-art methods.
- Score: 18.604039107883317
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Camouflaged object detection (COD) primarily focuses on learning subtle yet discriminative representations from complex scenes. Existing methods predominantly follow the parametric feedforward architecture based on static visual representation modeling. However, they lack explicit mechanisms for acquiring historical context, limiting their adaptation and effectiveness in handling challenging camouflage scenes. In this paper, we propose a recall-augmented COD architecture, namely RetroMem, which dynamically modulates camouflage pattern perception and inference by integrating relevant historical knowledge into the process. Specifically, RetroMem employs a two-stage training paradigm consisting of a learning stage and a recall stage to construct, update, and utilize memory representations effectively. During the learning stage, we design a dense multi-scale adapter (DMA) to improve the pretrained encoder's capability to capture rich multi-scale visual information with very few trainable parameters, thereby providing foundational inferences. In the recall stage, we propose a dynamic memory mechanism (DMM) and an inference pattern reconstruction (IPR). These components fully leverage the latent relationships between learned knowledge and current sample context to reconstruct the inference of camouflage patterns, thereby significantly improving the model's understanding of camouflage scenes. Extensive experiments on several widely used datasets demonstrate that our RetroMem significantly outperforms existing state-of-the-art methods.
Related papers
- Latent Structured Hopfield Network for Semantic Association and Retrieval [52.634915010996835]
Episodic memory enables humans to recall past experiences by associating semantic elements such as objects, locations, and time into coherent event representations.<n>We propose the Latent Structured Hopfield Network (LSHN), a framework that integrates continuous Hopfield attractor dynamics into an autoencoder architecture.<n>Unlike traditional Hopfield networks, our model is trained end-to-end with gradient descent, achieving scalable and robust memory retrieval.
arXiv Detail & Related papers (2025-06-02T04:24:36Z) - Latent Multimodal Reconstruction for Misinformation Detection [15.66049149213069]
"MisCaption This!" is a training dataset comprising LVLM-generated miscaptioned images.<n>"Latent Multimodal Reconstruction" (LAMAR) is a network trained to reconstruct the embeddings of truthful captions.<n>Experiments show that models trained on "MisCaption This!" better generalize on real-world misinformation.
arXiv Detail & Related papers (2025-04-08T13:16:48Z) - Multi-View Incremental Learning with Structured Hebbian Plasticity for Enhanced Fusion Efficiency [13.512920774125776]
Multi-view incremental framework named MVIL aims at emulating the brain's fine-grained fusion of sequentially arriving views.<n> MVIL lies two fundamental modules: structured Hebbian plasticity and synaptic partition learning.<n> Experimental results on six benchmark datasets show MVIL's effectiveness over state-of-the-art methods.
arXiv Detail & Related papers (2024-12-17T11:10:46Z) - Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters.
We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model.
Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - A Framework for Inference Inspired by Human Memory Mechanisms [9.408704431898279]
We propose a PMI framework that consists of perception, memory and inference components.
The memory module comprises working and long-term memory, with the latter endowed with a higher-order structure to retain extensive and complex relational knowledge and experience.
We apply our PMI to improve prevailing Transformers and CNN models on question-answering tasks like bAbI-20k and Sort-of-CLEVR datasets.
arXiv Detail & Related papers (2023-10-01T08:12:55Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Improving Image Recognition by Retrieving from Web-Scale Image-Text Data [68.63453336523318]
We introduce an attention-based memory module, which learns the importance of each retrieved example from the memory.
Compared to existing approaches, our method removes the influence of the irrelevant retrieved examples, and retains those that are beneficial to the input query.
We show that it achieves state-of-the-art accuracies in ImageNet-LT, Places-LT and Webvision datasets.
arXiv Detail & Related papers (2023-04-11T12:12:05Z) - Reconstruction-guided attention improves the robustness and shape
processing of neural networks [5.156484100374057]
We build an iterative encoder-decoder network that generates an object reconstruction and uses it as top-down attentional feedback.
Our model shows strong generalization performance against various image perturbations.
Our study shows that modeling reconstruction-based feedback endows AI systems with a powerful attention mechanism.
arXiv Detail & Related papers (2022-09-27T18:32:22Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.