Sleep-time Compute: Beyond Inference Scaling at Test-time
- URL: http://arxiv.org/abs/2504.13171v1
- Date: Thu, 17 Apr 2025 17:59:25 GMT
- Title: Sleep-time Compute: Beyond Inference Scaling at Test-time
- Authors: Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, Joseph E. Gonzalez,
- Abstract summary: Scaling test-time compute has emerged as a key ingredient for enabling large language models to solve difficult problems.<n>We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented.
- Score: 52.87976180878445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.
Related papers
- Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.<n>As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.<n>By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z) - Time-MQA: Time Series Multi-Task Question Answering with Context Enhancement [55.2439260314328]
Time Series Multi-Task Question Answering (Time-MQA) is a unified framework that enables natural language queries across multiple time series tasks.<n>Central to Time-MQA is the TSQA dataset, a large-scale dataset containing $sim $200k question-answer pairs.
arXiv Detail & Related papers (2025-02-26T13:47:13Z) - On Aggregation Queries over Predicted Nearest Neighbors [33.06696811081107]
We introduce AQNNs, a novel type of aggregation queries over the predicted neighborhood of a designated object.<n>AQNNs are prevalent in modern applications where, for instance, a medical professional may want to compute "the average systolic blood pressure of patients whose predicted condition is similar to a given insomnia patient"<n>Since prediction typically involves an expensive deep learning model or a human expert, we formulate query processing as the problem of returning an approximate aggregate.
arXiv Detail & Related papers (2025-02-26T04:17:32Z) - s1: Simple test-time scaling [148.4204982041058]
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance.<n>We seek the simplest approach to achieve test-time scaling and strong reasoning performance.
arXiv Detail & Related papers (2025-01-31T18:48:08Z) - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs.
We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z) - MorpheusNet: Resource efficient sleep stage classifier for embedded
on-line systems [0.0]
Sleep Stage Classification (SSC) is a labor-intensive task, requiring experts to examine hours of electrophysiological recordings for manual classification.
With increasing affordability and expansion of wearable devices, SSC may enable deployment of sleep-based therapies at scale.
Deep Learning has gained increasing attention as a potential method to automate this process.
arXiv Detail & Related papers (2024-01-14T17:52:08Z) - Annotating sleep states in children from wrist-worn accelerometer data
using Machine Learning [4.506099292980221]
We propose to model the accelerometer data using different machine learning (ML) techniques such as support vectors, boosting, ensemble methods, and more complex approaches involving LSTMs and Region-based CNNs.
Later, we aim to evaluate these approaches using the Event Detection Average Precision (EDAP) score (similar to the IOU metric) to eventually compare the predictive power and model performance.
arXiv Detail & Related papers (2023-12-09T09:10:39Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - SOMTimeS: Self Organizing Maps for Time Series Clustering and its
Application to Serious Illness Conversations [3.2689702143620147]
We present a new DTW-based clustering method called SOMTimeS (a Self-Organizing Map for TIME Series)
It scales better and runs faster than other DTW-based clustering algorithms, and has similar performance accuracy.
We applied SOMtimeS to natural language conversation data collected as part of a large healthcare cohort study.
arXiv Detail & Related papers (2021-08-26T00:18:25Z) - MSED: a multi-modal sleep event detection model for clinical sleep
analysis [62.997667081978825]
We designed a single deep neural network architecture to jointly detect sleep events in a polysomnogram.
The performance of the model was quantified by F1, precision, and recall scores, and by correlating index values to clinical values.
arXiv Detail & Related papers (2021-01-07T13:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.