Related papers: ETS: Efficient Tree Search for Inference-Time Scaling

ETS: Efficient Tree Search for Inference-Time Scaling

URL: http://arxiv.org/abs/2502.13575v1
Date: Wed, 19 Feb 2025 09:30:38 GMT
Title: ETS: Efficient Tree Search for Inference-Time Scaling
Authors: Coleman Hooper, Sehoon Kim, Suhong Moon, Kerem Dilmen, Monishwaran Maheswaran, Nicholas Lee, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, Amir Gholami,
Abstract summary: One promising approach for test-time compute scaling is search against a process reward model.<n> diversity of trajectories in the tree search process affects the accuracy of the search, since increasing diversity promotes more exploration.<n>We propose Efficient Tree Search (ETS), which promotes KV sharing by pruning redundant trajectories while maintaining necessary diverse trajectories.
Score: 61.553681244572914
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time compute scaling has emerged as a new axis along which to improve model accuracy, where additional computation is used at inference time to allow the model to think longer for more challenging problems. One promising approach for test-time compute scaling is search against a process reward model, where a model generates multiple potential candidates at each step of the search, and these partial trajectories are then scored by a separate reward model in order to guide the search process. The diversity of trajectories in the tree search process affects the accuracy of the search, since increasing diversity promotes more exploration. However, this diversity comes at a cost, as divergent trajectories have less KV sharing, which means they consume more memory and slow down the search process. Previous search methods either do not perform sufficient exploration, or else explore diverse trajectories but have high latency. We address this challenge by proposing Efficient Tree Search (ETS), which promotes KV sharing by pruning redundant trajectories while maintaining necessary diverse trajectories. ETS incorporates a linear programming cost model to promote KV cache sharing by penalizing the number of nodes retained, while incorporating a semantic coverage term into the cost model to ensure that we retain trajectories which are semantically different. We demonstrate how ETS can achieve 1.8$\times$ reduction in average KV cache size during the search process, leading to 1.4$\times$ increased throughput relative to prior state-of-the-art methods, with minimal accuracy degradation and without requiring any custom kernel implementation. Code is available at: https://github.com/SqueezeAILab/ETS.

Related papers

Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z)
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z)
MGAS: Multi-Granularity Architecture Search for Trade-Off Between Model Effectiveness and Efficiency [10.641875933652647]
We introduce multi-granularity architecture search (MGAS) to discover both effective and efficient neural networks. We learn discretization functions specific to each granularity level to adaptively determine the unit remaining ratio according to the evolving architecture. Extensive experiments on CIFAR-10, CIFAR-100 and ImageNet demonstrate that MGAS outperforms other state-of-the-art methods in achieving a better trade-off between model performance and model size.
arXiv Detail & Related papers (2023-10-23T16:32:18Z)
Single-Stage Visual Relationship Learning using Conditional Queries [60.90880759475021]
TraCQ is a new formulation for scene graph generation that avoids the multi-task learning problem and the entity pair distribution. We employ a DETR-based encoder-decoder conditional queries to significantly reduce the entity label space as well. Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset.
arXiv Detail & Related papers (2023-06-09T06:02:01Z)
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z)
Deep Forest with Hashing Screening and Window Screening [25.745779145969053]
We introduce a hashing screening mechanism for multi-grained scanning of gcForest. We propose a model called HW-Forest which adopts two strategies, hashing screening and window screening. Our experimental results show that HW-Forest has higher accuracy than other models, and the time cost is also reduced.
arXiv Detail & Related papers (2022-07-25T07:39:55Z)
Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs. Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z)
Fast Line Search for Multi-Task Learning [0.0]
We propose a novel idea for line search algorithms in multi-task learning. The idea is to use latent representation space instead of parameter space for finding step size. We compare this idea with classical backtracking and gradient methods with a constant learning rate on MNIST, CIFAR-10, Cityscapes tasks.
arXiv Detail & Related papers (2021-10-02T21:02:29Z)
Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models. Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z)
Stagnation Detection in Highly Multimodal Fitness Landscapes [0.0]
Stagnation detection has been proposed as a mechanism for randomized searchs to escape from local optima. In this paper, we investigate a new mechanism called radius memory which can be added to stagnation detection to control the search radius more carefully. We implement this idea in an algorithm called SD-RLS$textm$ and show compared to previous variants of stagnation detection that it yields speed-ups.
arXiv Detail & Related papers (2021-04-09T14:33:52Z)
Effective and Fast: A Novel Sequential Single Path Search for Mixed-Precision Quantization [45.22093693422085]
Mixed-precision quantization model can match different quantization bit-precisions according to the sensitivity of different layers to achieve great performance. It is a difficult problem to quickly determine the quantization bit-precision of each layer in deep neural networks according to some constraints. We propose a novel sequential single path search (SSPS) method for mixed-precision quantization.
arXiv Detail & Related papers (2021-03-04T09:15:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.