Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models
- URL: http://arxiv.org/abs/2412.06458v1
- Date: Mon, 09 Dec 2024 13:02:35 GMT
- Title: Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models
- Authors: Wei Suo, Ji Ma, Mengyang Sun, Lin Yuanbo Wu, Peng Wang, Yanning Zhang,
- Abstract summary: We propose a novel framework for inference acceleration called the Pruning All-Rounder (PAR)
With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of pruning scenarios.
- Score: 42.124670377223175
- License:
- Abstract: Although Large Vision-Language Models (LVLMs) have achieved impressive results, their high computational cost poses a significant barrier to wider application. To enhance inference efficiency, most existing approaches depend on parameter-dependent or token-dependent strategies to reduce computational demands. However, these methods typically require complex training processes and struggle to consistently select the most relevant tokens. In this paper, we systematically analyze the above challenges and provide a series of valuable insights for inference acceleration. Based on these findings, we propose a novel framework, the Pruning All-Rounder (PAR). Different from previous works, PAR develops a meta-router to adaptively organize pruning flows across both tokens and layers. With a self-supervised learning manner, our method achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of pruning scenarios. The code for this work will be made publicly available.
Related papers
- Bag of Tricks for Inference-time Computation of LLM Reasoning [10.366475014241407]
We investigate and benchmark diverse inference-time computation strategies across reasoning tasks of varying complexity.
Our ablation studies reveal that previously overlooked strategies can significantly enhance performance.
We establish a standardized benchmark for inference-time computation by systematically evaluating six representative methods across eight reasoning tasks.
arXiv Detail & Related papers (2025-02-11T02:31:11Z) - FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws.
We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens.
Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z) - DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization [61.492590008258986]
Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs.
We propose DRPruning, which incorporates distributionally robust optimization to restore balanced performance across domains.
arXiv Detail & Related papers (2024-11-21T12:02:39Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - QPO: Query-dependent Prompt Optimization via Multi-Loop Offline Reinforcement Learning [58.767866109043055]
We introduce Query-dependent Prompt Optimization (QPO), which iteratively fine-tune a small pretrained language model to generate optimal prompts tailored to the input queries.
We derive insights from offline prompting demonstration data, which already exists in large quantities as a by-product of benchmarking diverse prompts on open-sourced tasks.
Experiments on various LLM scales and diverse NLP and math tasks demonstrate the efficacy and cost-efficiency of our method in both zero-shot and few-shot scenarios.
arXiv Detail & Related papers (2024-08-20T03:06:48Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - Less is KEN: a Universal and Simple Non-Parametric Pruning Algorithm for Large Language Models [1.5807079236265718]
KEN is a straightforward, universal and unstructured pruning algorithm based on Kernel Density Estimation (KDE)
Ken aims to construct optimized transformers by selectively preserving the most significant parameters while restoring others to their pre-training state.
Ken achieves equal or better performance than their original unpruned versions, with a minimum parameter reduction of 25%.
arXiv Detail & Related papers (2024-02-05T16:11:43Z) - Tree-of-Mixed-Thought: Combining Fast and Slow Thinking for Multi-hop
Visual Reasoning [16.495754104540605]
Large language models (LLMs) can generate code-like plans for complex inference tasks such as visual reasoning.
We propose a hierarchical plan-searching algorithm that integrates the one-stop reasoning (fast) and the Tree-of-thought (slow)
arXiv Detail & Related papers (2023-08-18T16:21:40Z) - Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers [29.319666323947708]
We present a novel approach that dynamically prunes contextual information while preserving the model's expressiveness.
Our method employs a learnable mechanism that determines which uninformative tokens can be dropped from the context.
Our reference implementation achieves up to $2times$ increase in inference throughput and even greater memory savings.
arXiv Detail & Related papers (2023-05-25T07:39:41Z) - OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs.
Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.