Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
- URL: http://arxiv.org/abs/2312.04265v5
- Date: Thu, 18 Apr 2024 08:33:37 GMT
- Title: Stronger, Fewer, & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation
- Authors: Zhixiang Wei, Lin Chen, Yi Jin, Xiaoxiao Ma, Tianle Liu, Pengyang Ling, Ben Wang, Huaian Chen, Jinjin Zheng,
- Abstract summary: We first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic (DGSS)
We introduce a robust fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for DGSS.
With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full parameter fine-tuning.
- Score: 16.12797115277337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). Driven by the motivation that Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability, we introduce a robust fine-tuning approach, namely Rein, to parameter-efficiently harness VFMs for DGSS. Built upon a set of trainable tokens, each linked to distinct instances, Rein precisely refines and forwards the feature maps from each layer to the next layer within the backbone. This process produces diverse refinements for different categories within a single image. With fewer trainable parameters, Rein efficiently fine-tunes VFMs for DGSS tasks, surprisingly surpassing full parameter fine-tuning. Extensive experiments across various settings demonstrate that Rein significantly outperforms state-of-the-art methods. Remarkably, with just an extra 1% of trainable parameters within the frozen backbone, Rein achieves a mIoU of 78.4% on the Cityscapes, without accessing any real urban-scene datasets.Code is available at https://github.com/w1oves/Rein.git.
Related papers
- Rein++: Efficient Generalization and Adaptation for Semantic Segmentation with Vision Foundation Models [47.66611300605174]
Rein++ is an efficient VFM-based segmentation framework.<n>It demonstrates superior generalization from limited data.<n>It enables effective adaptation to diverse unlabeled scenarios.
arXiv Detail & Related papers (2025-08-03T08:53:30Z) - Improving Progressive Generation with Decomposable Flow Matching [50.63174319509629]
Decomposable Flow Matching (DFM) is a simple and effective framework for the progressive generation of visual media.<n>On Imagenet-1k 512px, DFM achieves 35.2% improvements in FDD scores over the base architecture and 26.4% over the best-performing baseline.
arXiv Detail & Related papers (2025-06-24T17:58:02Z) - Dynamic Fisher-weighted Model Merging via Bayesian Optimization [37.02810891820468]
Existing merging approaches typically involve scaling the parameters model-wise or integrating parameter importance parameter-wise.
We unify these strategies into a more general merging framework, and introduce Dynamic Fisher-weighted Merging (DF-Merge)
We show that DF-Merge outperforms strong baselines across models of different sizes and a variety of tasks.
arXiv Detail & Related papers (2025-04-26T18:31:14Z) - Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks.
By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.
Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Adapting Vision Foundation Models for Robust Cloud Segmentation in Remote Sensing Images [22.054023867495722]
Cloud segmentation is a critical challenge in remote sensing image interpretation.
We present a parameter-efficient adaptive approach, termed Cloud-Adapter, to enhance the accuracy and robustness of cloud segmentation.
arXiv Detail & Related papers (2024-11-20T08:37:39Z) - Selective Self-Rehearsal: A Fine-Tuning Approach to Improve Generalization in Large Language Models [19.752712857873043]
This paper introduces Selective Self-Rehearsal (SSR), a fine-tuning approach that achieves performance comparable to the standard supervised fine-tuning (SFT)
By utilizing the model's correct responses, SSR reduces model specialization during the fine-tuning stage.
The effectiveness of SSR is demonstrated through experiments on the task of identifying unanswerable queries across various datasets.
arXiv Detail & Related papers (2024-09-07T10:21:03Z) - No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations [30.9134119244757]
FUNGI is a method to enhance the features of transformer encoders by leveraging self-supervised gradients.
Our method is simple: given any pretrained model, we first compute gradients from various self-supervised objectives for each input.
The resulting features are evaluated on k-nearest neighbor classification over 11 datasets from vision, 5 from natural language processing, and 2 from audio.
arXiv Detail & Related papers (2024-07-15T17:58:42Z) - TAIA: Large Language Models are Out-of-Distribution Data Learners [30.57872423927015]
We propose an effective inference-time intervention method: Training All parameters but Inferring with only Attention (trainallInfAttn)
trainallInfAttn achieves superior improvements compared to both the fully fine-tuned model and the base model in most scenarios.
The high tolerance of trainallInfAttn to data mismatches makes it resistant to jailbreaking tuning and enhances specialized tasks using general data.
arXiv Detail & Related papers (2024-05-30T15:57:19Z) - Dynamic Pre-training: Towards Efficient and Scalable All-in-One Image Restoration [100.54419875604721]
All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation.
We propose DyNet, a dynamic family of networks designed in an encoder-decoder style for all-in-one image restoration tasks.
Our DyNet can seamlessly switch between its bulkier and lightweight variants, thereby offering flexibility for efficient model deployment.
arXiv Detail & Related papers (2024-04-02T17:58:49Z) - Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis [51.14136878142034]
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models.
Existing methods for model adaptation usually update all model parameters, which is inefficient as it relies on high computational costs.
In this paper, we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency.
arXiv Detail & Related papers (2024-03-03T08:25:04Z) - More than the Sum of Its Parts: Ensembling Backbone Networks for
Few-Shot Segmentation [49.090592800481616]
We study whether fusing features from different backbones can improve the ability of acrlongfss models to capture richer visual features.
We propose and compare two ensembling techniques-Independent Voting and Feature Fusion.
Our approach outperforms the original single-backbone PANet across standard benchmarks even in challenging one-shot learning scenarios.
arXiv Detail & Related papers (2024-02-09T18:01:15Z) - Single-Model and Any-Modality for Video Object Tracking [85.83753760853142]
We introduce Un-Track, a Unified Tracker of a single set of parameters for any modality.
To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques.
Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters.
arXiv Detail & Related papers (2023-11-27T14:17:41Z) - Prompt Tuning for Parameter-efficient Medical Image Segmentation [79.09285179181225]
We propose and investigate several contributions to achieve a parameter-efficient but effective adaptation for semantic segmentation on two medical imaging datasets.
We pre-train this architecture with a dedicated dense self-supervision scheme based on assignments to online generated prototypes.
We demonstrate that the resulting neural network model is able to attenuate the gap between fully fine-tuned and parameter-efficiently adapted models.
arXiv Detail & Related papers (2022-11-16T21:55:05Z) - DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language
Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive.
We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights.
Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z) - Adaptive Recursive Circle Framework for Fine-grained Action Recognition [95.51097674917851]
How to model fine-grained spatial-temporal dynamics in videos has been a challenging problem for action recognition.
Most existing methods generate features of a layer in a pure feedforward manner.
We propose an Adaptive Recursive Circle framework, a fine-grained decorator for pure feedforward layers.
arXiv Detail & Related papers (2021-07-25T14:24:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.