Related papers: No Re-Train, More Gain: Upgrading Backbones with Diffusion model for Pixel-Wise and Weakly-Supervised Few-Shot Segmentation

No Re-Train, More Gain: Upgrading Backbones with Diffusion model for Pixel-Wise and Weakly-Supervised Few-Shot Segmentation

URL: http://arxiv.org/abs/2407.16182v2
Date: Mon, 07 Apr 2025 12:39:44 GMT
Title: No Re-Train, More Gain: Upgrading Backbones with Diffusion model for Pixel-Wise and Weakly-Supervised Few-Shot Segmentation
Authors: Shuai Chen, Fanman Meng, Chenhao Wu, Haoran Wei, Runtong Zhang, Qingbo Wu, Linfeng Xu, Hongliang Li,
Abstract summary: Few-Shot (FSS) aims to segment novel classes using only a few annotated images.<n>Current FSS methods face three issues: the inflexibility of backbone upgrade without re-training, the inability to uniformly handle various types of annotations.<n>We propose DiffUp, a novel framework that conceptualizes the FSS task as a conditional generative problem using a diffusion process.
Score: 22.263029309151467
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Few-Shot Segmentation (FSS) aims to segment novel classes using only a few annotated images. Despite considerable progress under pixel-wise support annotation, current FSS methods still face three issues: the inflexibility of backbone upgrade without re-training, the inability to uniformly handle various types of annotations (e.g., scribble, bounding box, mask, and text), and the difficulty in accommodating different annotation quantity. To address these issues simultaneously, we propose DiffUp, a novel framework that conceptualizes the FSS task as a conditional generative problem using a diffusion process. For the first issue, we introduce a backbone-agnostic feature transformation module that converts different segmentation cues into unified coarse priors, facilitating seamless backbone upgrade without re-training. For the second issue, due to the varying granularity of transformed priors from diverse annotation types (scribble, bounding box, mask, and text), we conceptualize these multi-granular transformed priors as analogous to noisy intermediates at different steps of a diffusion model. This is implemented via a self-conditioned modulation block coupled with a dual-level quality modulation branch. For the third issue, we incorporate an uncertainty-aware information fusion module to harmonize the variability across zero-shot, one-shot, and many-shot scenarios. Evaluated through rigorous benchmarks, DiffUp significantly outperforms existing FSS models in terms of flexibility and accuracy.

Related papers

MoFu: Scale-Aware Modulation and Fourier Fusion for Multi-Subject Video Generation [48.45457225939052]
MoFu is a unified framework that tackles scale inconsistency and permutation sensitivity.<n>MoFu significantly outperforms existing methods in preserving natural scale, subject fidelity, and overall visual quality.
arXiv Detail & Related papers (2025-12-26T09:29:30Z)
Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach [99.80480649258557]
DiTFuse is an instruction-driven framework that performs semantics-aware fusion within a single model.<n>Experiments on public IVIF, MFF, and MEF benchmarks confirm superior quantitative and qualitative performance, sharper textures, and better semantic retention.
arXiv Detail & Related papers (2025-12-08T05:04:54Z)
Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization [50.5332987313297]
We propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module.<n>TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution.<n>In experiments on MS-COCO and three diffusion backbones, TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality.
arXiv Detail & Related papers (2025-11-25T00:42:09Z)
Training-Free Multi-Style Fusion Through Reference-Based Adaptive Modulation [10.053310365345412]
Adaptive Multi-Style Fusion (AMSF) is a training-free framework that enables controllable fusion of multiple reference styles in diffusion models.<n>AMSF produces multi-style fusion results that consistently outperform state-of-the-art approaches.<n>These capabilities position AMSF as a practical step toward expressive multi-style generation in diffusion models.
arXiv Detail & Related papers (2025-09-23T03:47:59Z)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z)
Prompt as Free Lunch: Enhancing Diversity in Source-Free Cross-domain Few-shot Learning through Semantic-Guided Prompting [9.116108409344177]
The source-free cross-domain few-shot learning task aims to transfer pretrained models to target domains utilizing minimal samples. We propose the SeGD-VPT framework, which is divided into two phases. The first step aims to increase feature diversity by adding diversity prompts to each support sample, thereby generating varying input and enhancing sample diversity.
arXiv Detail & Related papers (2024-12-01T11:00:38Z)
CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset [26.056704438848985]
We propose a novel model, CCExpert, based on a new, advanced multimodal large model framework. First, we design a difference-aware integration module to capture multi-scale differences between bi-temporal images. Secondly, we constructed a high-quality, diversified dataset called CC-Foundation, containing 200,000 image pairs and 1.2 million captions. Finally, we employed a three-stage progressive training process to ensure the deep integration of the difference-aware integration module with the pretrained MLLM.
arXiv Detail & Related papers (2024-11-18T08:10:49Z)
CoSIGN: Few-Step Guidance of ConSIstency Model to Solve General INverse Problems [3.3969056208620128]
We propose to push the boundary of inference steps to 1-2 NFEs while still maintaining high reconstruction quality. Our method achieves new state-of-the-art in diffusion-based inverse problem solving.
arXiv Detail & Related papers (2024-07-17T15:57:50Z)
Memory-guided Network with Uncertainty-based Feature Augmentation for Few-shot Semantic Segmentation [12.653336728447654]
We propose a class-shared memory (CSM) module consisting of a set of learnable memory vectors. These memory vectors learn elemental object patterns from base classes during training whilst re-encoding query features during both training and inference. We integrate CSM and UFA into representative FSS works, with experimental results on the widely-used PASCAL-5$i$ and COCO-20$i$ datasets.
arXiv Detail & Related papers (2024-06-01T19:53:25Z)
MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration [7.087475633143941]
MM-Diff is a tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings.
arXiv Detail & Related papers (2024-03-22T09:32:31Z)
One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls [77.42510898755037]
One More Step (OMS) is a compact network that incorporates an additional simple yet effective step during inference. OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters. Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.
arXiv Detail & Related papers (2023-11-27T12:02:42Z)
Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption [73.98706049140098]
We propose a novel phasic content fusing few-shot diffusion model with directional distribution consistency loss. Specifically, we design a phasic training strategy with phasic content fusion to help our model learn content and style information when t is large. Finally, we propose a cross-domain structure guidance strategy that enhances structure consistency during domain adaptation.
arXiv Detail & Related papers (2023-09-07T14:14:11Z)
Improving Misaligned Multi-modality Image Fusion with One-stage Progressive Dense Registration [67.23451452670282]
Misalignments between multi-modality images pose challenges in image fusion. We propose a Cross-modality Multi-scale Progressive Dense Registration scheme. This scheme accomplishes the coarse-to-fine registration exclusively using a one-stage optimization.
arXiv Detail & Related papers (2023-08-22T03:46:24Z)
Exploring Multi-Timestep Multi-Stage Diffusion Features for Hyperspectral Image Classification [16.724299091453844]
Diffusion-based HSI classification methods only utilize manually selected single-timestep single-stage features. We propose a novel diffusion-based feature learning framework that explores Multi-Timestep Multi-Stage Diffusion features for HSI classification for the first time, called MTMSD. Our method outperforms state-of-the-art methods for HSI classification, especially on the challenging Houston 2018 dataset.
arXiv Detail & Related papers (2023-06-15T08:56:58Z)
Diffusion Visual Counterfactual Explanations [51.077318228247925]
Visual Counterfactual Explanations (VCEs) are an important tool to understand the decisions of an image. Current approaches for the generation of VCEs are restricted to adversarially robust models and often contain non-realistic artefacts. In this paper, we overcome this by generating Visual Diffusion Counterfactual Explanations (DVCEs) for arbitrary ImageNet classifiers.
arXiv Detail & Related papers (2022-10-21T09:35:47Z)
f-DM: A Multi-stage Diffusion Model via Progressive Signal Transformation [56.04628143914542]
Diffusion models (DMs) have recently emerged as SoTA tools for generative modeling in various domains. We propose f-DM, a generalized family of DMs which allows progressive signal transformation. We apply f-DM in image generation tasks with a range of functions, including down-sampling, blurring, and learned transformations.
arXiv Detail & Related papers (2022-10-10T18:49:25Z)
Progressive Multi-scale Consistent Network for Multi-class Fundus Lesion Segmentation [28.58972084293778]
We propose a progressive multi-scale consistent network (PMCNet) that integrates the proposed progressive feature fusion (PFF) block and dynamic attention block (DAB) PFF block progressively integrates multi-scale features from adjacent encoding layers, facilitating feature learning of each layer by aggregating fine-grained details and high-level semantics. DAB is designed to dynamically learn the attentive cues from the fused features at different scales, thus aiming to smooth the essential conflicts existing in multi-scale features.
arXiv Detail & Related papers (2022-05-31T12:10:01Z)
Exploring Complementary Strengths of Invariant and Equivariant Representations for Few-Shot Learning [96.75889543560497]
In many real-world problems, collecting a large number of labeled samples is infeasible. Few-shot learning is the dominant approach to address this issue, where the objective is to quickly adapt to novel categories in presence of a limited number of samples. We propose a novel training mechanism that simultaneously enforces equivariance and invariance to a general set of geometric transformations.
arXiv Detail & Related papers (2021-03-01T21:14:33Z)
Recurrent Multi-view Alignment Network for Unsupervised Surface Registration [79.72086524370819]
Learning non-rigid registration in an end-to-end manner is challenging due to the inherent high degrees of freedom and the lack of labeled training data. We propose to represent the non-rigid transformation with a point-wise combination of several rigid transformations. We also introduce a differentiable loss function that measures the 3D shape similarity on the projected multi-view 2D depth images.
arXiv Detail & Related papers (2020-11-24T14:22:42Z)
Prior Guided Feature Enrichment Network for Few-Shot Segmentation [64.91560451900125]
State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results. Few-shot segmentation is proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples. Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information.
arXiv Detail & Related papers (2020-08-04T10:41:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.