Related papers: Transferable and Principled Efficiency for Open-Vocabulary Segmentation

Transferable and Principled Efficiency for Open-Vocabulary Segmentation

URL: http://arxiv.org/abs/2404.07448v3
Date: Tue, 17 Sep 2024 03:21:01 GMT
Title: Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Authors: Jingxuan Xu, Wuyang Chen, Yao Zhao, Yunchao Wei,
Abstract summary: Recent success of pre-trained foundation vision-language computation models makes Open-Vocabulary (OVS) possible. This approach introduces heavy computational overheads for two challenges: 1) large model sizes of the backbone; 2) expensive costs during the fine-tuning. In this paper, we aim to achieve performance that is comparable to or even better than prior OVS works based on large vision-language foundation models.
Score: 82.66423763561697
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent success of pre-trained foundation vision-language models makes Open-Vocabulary Segmentation (OVS) possible. Despite the promising performance, this approach introduces heavy computational overheads for two challenges: 1) large model sizes of the backbone; 2) expensive costs during the fine-tuning. These challenges hinder this OVS strategy from being widely applicable and affordable in real-world scenarios. Although traditional methods such as model compression and efficient fine-tuning can address these challenges, they often rely on heuristics. This means that their solutions cannot be easily transferred and necessitate re-training on different models, which comes at a cost. In the context of efficient OVS, we target achieving performance that is comparable to or even better than prior OVS works based on large vision-language foundation models, by utilizing smaller models that incur lower training costs. The core strategy is to make our efficiency principled and thus seamlessly transferable from one OVS framework to others without further customization. Comprehensive experiments on diverse OVS benchmarks demonstrate our superior trade-off between segmentation accuracy and computation costs over previous works. Our code is available on https://github.com/Xujxyang/OpenTrans

Related papers

FREE: Fast and Robust Vision Language Models with Early Exits [5.402030962296633]
We introduce FREE, an adversarial training approach within a GAN-based framework.<n>Our method focuses on performing input-adaptive inference that increases inference speed with minimal drop in performance.<n>We experimentally validate that our method speeds up the inference process by more than 1.51x while retaining comparable performance.
arXiv Detail & Related papers (2025-06-07T18:26:58Z)
DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding [26.39397960987363]
We propose a simple modification to pretrained transformer models.<n>Instead of concatenation with the language prompt at the start, we insert multimodal tokens directly into the middle.<n>Our results indicate that our method reduces computational costs during both training and inference.
arXiv Detail & Related papers (2025-04-27T18:56:26Z)
Learning Free Token Reduction for Multi-Modal Large Language Models [3.4026156483879517]
Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks. However, their practical deployment is often constrained by high computational costs and prolonged inference times. We propose a token compression paradigm that operates on both spatial and temporal dimensions.
arXiv Detail & Related papers (2025-01-29T02:52:32Z)
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models [42.124670377223175]
We propose a novel framework for inference acceleration called the Pruning All-Rounder (PAR) With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of pruning scenarios.
arXiv Detail & Related papers (2024-12-09T13:02:35Z)
Exploiting Distribution Constraints for Scalable and Efficient Image Retrieval [1.6874375111244329]
State-of-the-art image retrieval systems train specific neural networks for each dataset. Off-the-shelf foundation models fall short in achieving performance comparable to dataset-specific models. We introduce Autoencoders with Strong Variance Constraints (AE-SVC), which significantly improves the performance of foundation models.
arXiv Detail & Related papers (2024-10-09T16:05:16Z)
Agreement-Based Cascading for Efficient Inference [32.914852531806]
Agreement-Based Cascading (ABC) is a simple, effective adaptive inference technique. ABC builds a cascade of models of increasing size/complexity, and uses agreement between ensembles of models at each level of the cascade as a basis for data-dependent routing. We show that ABC can reliably act as a drop-in replacement for existing models and surpass the best single model it aims to replace in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2024-07-02T15:14:12Z)
Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection [80.63946798650653]
Decision centers on whether to use a large LLM with better performance or a smaller one with reduced costs. We propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion. Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
arXiv Detail & Related papers (2024-05-03T14:38:59Z)
Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing. LLMs are extremely computationally expensive, even at inference time. We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning. ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models. Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z)
Prior Guided Feature Enrichment Network for Few-Shot Segmentation [64.91560451900125]
State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results. Few-shot segmentation is proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples. Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information.
arXiv Detail & Related papers (2020-08-04T10:41:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.