Bridging Vision and Language Encoders: Parameter-Efficient Tuning for
Referring Image Segmentation
- URL: http://arxiv.org/abs/2307.11545v1
- Date: Fri, 21 Jul 2023 12:46:15 GMT
- Title: Bridging Vision and Language Encoders: Parameter-Efficient Tuning for
Referring Image Segmentation
- Authors: Zunnan Xu, Zhihong Chen, Yong Zhang, Yibing Song, Xiang Wan, Guanbin
Li
- Abstract summary: We do an investigation of efficient tuning problems on referring image segmentation.
We propose a novel adapter called Bridger to facilitate cross-modal information exchange.
We also design a lightweight decoder for image segmentation.
- Score: 72.27914940012423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Parameter Efficient Tuning (PET) has gained attention for reducing the number
of parameters while maintaining performance and providing better hardware
resource savings, but few studies investigate dense prediction tasks and
interaction between modalities. In this paper, we do an investigation of
efficient tuning problems on referring image segmentation. We propose a novel
adapter called Bridger to facilitate cross-modal information exchange and
inject task-specific information into the pre-trained model. We also design a
lightweight decoder for image segmentation. Our approach achieves comparable or
superior performance with only 1.61\% to 3.38\% backbone parameter updates,
evaluated on challenging benchmarks. The code is available at
\url{https://github.com/kkakkkka/ETRIS}.
Related papers
- Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference [14.030836300221756]
textbfSparse-Tuning is a novel PEFT method that accounts for the information redundancy in images and videos.
Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead.
Our results show that our Sparse-Tuning reduces GFLOPs to textbf62%-70% of the original ViT-B while achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-23T15:34:53Z) - Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis [51.14136878142034]
Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models.
Existing methods for model adaptation usually update all model parameters, which is inefficient as it relies on high computational costs.
In this paper, we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency.
arXiv Detail & Related papers (2024-03-03T08:25:04Z) - SPPNet: A Single-Point Prompt Network for Nuclei Image Segmentation [6.149725843029721]
Single-point prompt network is proposed for nuclei image segmentation.
We replace the original image encoder with a lightweight vision transformer.
The proposed model is evaluated on the MoNuSeg-2018 dataset.
arXiv Detail & Related papers (2023-08-23T16:13:58Z) - Ray-Patch: An Efficient Querying for Light Field Transformers [10.859910783551937]
We propose the Ray-Patch querying, a novel model to efficiently query transformers to decode implicit representations into target views.
Our Ray-Patch decoding reduces the computational footprint and increases inference speed up to one order of magnitude compared to previous models.
arXiv Detail & Related papers (2023-05-16T16:03:27Z) - Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges.
Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning.
A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Evaluation of Dirichlet Process Gaussian Mixtures for Segmentation on
Noisy Hyperspectral Images [1.4721615285883425]
This paper proposes and evaluates a method for segmentation of Hyperspectral Images using the Dirichlet Process Gaussian Mixture Model.
Our model can self-regulate the parameters until it finds the optimal values of scale and the number of clusters in a given dataset.
Results demonstrate the potential of our method to find objects in a Hyperspectral Image while bypassing the burden of manual search of the optimal parameters.
arXiv Detail & Related papers (2022-03-05T21:44:52Z) - ACORT: A Compact Object Relation Transformer for Parameter Efficient
Image Captioning [13.659124860884912]
We present three methods for image captioning model reduction.
Our proposed ACORT models have 3.7x to 21.6x fewer parameters than the baseline model.
Results demonstrate that our ACORT models are competitive against baselines and SOTA approaches.
arXiv Detail & Related papers (2022-02-11T05:10:28Z) - Few-Shot Segmentation via Cycle-Consistent Transformer [74.49307213431952]
We focus on utilizing pixel-wise relationships between support and target images to facilitate the few-shot semantic segmentation task.
We propose using a novel cycle-consistent attention mechanism to filter out possible harmful support features.
Our proposed CyCTR leads to remarkable improvement compared to previous state-of-the-art methods.
arXiv Detail & Related papers (2021-06-04T07:57:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.