Streaming, Fast and Slow: Cognitive Load-Aware Streaming for Efficient LLM Serving
- URL: http://arxiv.org/abs/2504.17999v1
- Date: Fri, 25 Apr 2025 00:58:37 GMT
- Title: Streaming, Fast and Slow: Cognitive Load-Aware Streaming for Efficient LLM Serving
- Authors: Chang Xiao, Brenda Yang,
- Abstract summary: Streaming content faster than users can read appears unnecessary, resulting in wasted computational resources and potential delays for other users.<n>We propose an adaptive streaming method that dynamically adjusts the pacing of LLM streaming output in real-time based on inferred cognitive load.<n>Our approach estimates the cognitive load associated with streaming content and strategically slows down the stream during complex or information-rich segments, thereby freeing computational resources for other users.
- Score: 10.632179121247466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generative conversational interfaces powered by large language models (LLMs) typically stream output token-by-token at a rate determined by computational budget, often neglecting actual human reading speeds and the cognitive load associated with the content. This mismatch frequently leads to inefficient use of computational resources. For example, in cloud-based services, streaming content faster than users can read appears unnecessary, resulting in wasted computational resources and potential delays for other users, particularly during peak usage periods. To address this issue, we propose an adaptive streaming method that dynamically adjusts the pacing of LLM streaming output in real-time based on inferred cognitive load. Our approach estimates the cognitive load associated with streaming content and strategically slows down the stream during complex or information-rich segments, thereby freeing computational resources for other users. Our statistical analysis of computational savings, combined with crowdsourced user studies, provides insights into the trade-offs between service efficiency and user satisfaction, demonstrating that our method can significantly reduce computational consumption up to 16.8\%. This context-aware computational resource management strategy presents a practical framework for enhancing system efficiency in cloud-based conversational AI interfaces without compromising user experience.
Related papers
- StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition [19.54521322177521]
We introduce StreamMind, a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100)<n>We propose a novel perception-cognition intertemporal paradigm named ''event-gated LLM invocation''<n> Experiments on Ego4D and SoccerNet streaming tasks, as well as standard offline benchmarks, demonstrate state-of-the-art performance in both model capability and real-time efficiency.
arXiv Detail & Related papers (2025-03-08T13:44:38Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.
We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Efficiently Serving LLM Reasoning Programs with Certaindex [4.681117143870077]
Dynasor is a system that optimize inference-time compute for large language models (LLMs) reasoning queries.<n>Unlike traditional engines, Dynasor tracks and schedules requests within reasoning queries.<n>It reduces compute by up to 50% in batch processing and sustaining 3.3x higher query rates or 4.7x tighter latency SLOs in online serving.
arXiv Detail & Related papers (2024-12-30T14:57:53Z) - Enabling Real-Time Conversations with Minimal Training Costs [61.80370154101649]
This paper presents a new duplex decoding approach that enhances large language models with duplex ability, requiring minimal training.
Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.
arXiv Detail & Related papers (2024-09-18T06:27:26Z) - Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding [61.45448947483328]
We introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER)
LASER features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens.
LASER achieves a 3-5x speedup on public datasets and saves about 67% of computational resources during the online A/B test.
arXiv Detail & Related papers (2024-08-11T02:31:13Z) - Switchable Decision: Dynamic Neural Generation Networks [98.61113699324429]
We propose a switchable decision to accelerate inference by dynamically assigning resources for each data instance.
Our method benefits from less cost during inference while keeping the same accuracy.
arXiv Detail & Related papers (2024-05-07T17:44:54Z) - MemFlow: Optical Flow Estimation and Prediction with Memory [54.22820729477756]
We present MemFlow, a real-time method for optical flow estimation and prediction with memory.
Our method enables memory read-out and update modules for aggregating historical motion information in real-time.
Our approach seamlessly extends to the future prediction of optical flow based on past observations.
arXiv Detail & Related papers (2024-04-07T04:56:58Z) - Breaking the Length Barrier: LLM-Enhanced CTR Prediction in Long Textual User Behaviors [25.086118164540974]
Large language models (LLMs) are used to improve the performance of click-through rate (CTR) prediction.
As user sequences grow longer, the current efficiency of LLMs is inadequate for training on billions of users and items.
We propose Behavior Aggregated Hierarchical (BAHE) to enhance the efficiency of LLM-based CTR modeling.
arXiv Detail & Related papers (2024-03-28T12:05:15Z) - Learn to Compress (LtC): Efficient Learning-based Streaming Video
Analytics [3.2872586139884623]
LtC is a collaborative framework between the video source and the analytics server that efficiently learns to reduce the video streams within an analytics pipeline.
LtC is able to use 28-35% less bandwidth and has up to 45% shorter response delay compared to recently published state of the art streaming frameworks.
arXiv Detail & Related papers (2023-07-22T21:36:03Z) - Fast Context Adaptation in Cost-Aware Continual Learning [10.515324071327903]
5G and Beyond networks require more complex learning agents and the learning process itself might end up competing with users for communication and computational resources.
This creates friction: on the one hand, the learning process needs resources to quickly convergence to an effective strategy; on the other hand, the learning process needs to be efficient, i.e. take as few resources as possible from the user's data plane, so as not to throttle users' resources.
In this paper, we propose a dynamic strategy to balance the resources assigned to the data plane and those reserved for learning.
arXiv Detail & Related papers (2023-06-06T17:46:48Z) - Learnability with Time-Sharing Computational Resource Concerns [65.268245109828]
We present a theoretical framework that takes into account the influence of computational resources in learning theory.
This framework can be naturally applied to stream learning where the incoming data streams can be potentially endless.
It may also provide a theoretical perspective for the design of intelligent supercomputing operating systems.
arXiv Detail & Related papers (2023-05-03T15:54:23Z) - Dynamic Scheduling for Federated Edge Learning with Streaming Data [56.91063444859008]
We consider a Federated Edge Learning (FEEL) system where training data are randomly generated over time at a set of distributed edge devices with long-term energy constraints.
Due to limited communication resources and latency requirements, only a subset of devices is scheduled for participating in the local training process in every iteration.
arXiv Detail & Related papers (2023-05-02T07:41:16Z) - Lightweight Event-based Optical Flow Estimation via Iterative Deblurring [22.949700247611695]
We introduce IDNet, a lightweight yet high-performing event-based optical flow network directly estimating flow from event traces without using correlation volumes.
Our top-performing ID model sets a new state of art on DSEC benchmark.
Our base ID model is competitive with prior arts while using 80% fewer parameters, consuming 20x less memory footprint and running 40% faster on the NVidia Jetson Xavier NX.
arXiv Detail & Related papers (2022-11-24T17:26:27Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception.
We build a simple and effective framework for streaming perception.
Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - Optimal Resource Allocation for Serverless Queries [8.59568779761598]
Prior work focused on predicting peak allocation while ignoring aggressive trade-offs between resource allocation and run-time.
We introduce a system for optimal resource allocation that can predict performance with aggressive trade-offs, for both new and past observed queries.
arXiv Detail & Related papers (2021-07-19T02:55:48Z) - Faster than LASER -- Towards Stream Reasoning with Deep Neural Networks [0.6649973446180738]
Stream Reasoners aim at bridging this gap between reasoning and stream processing.
LASER is a stream reasoner designed to analyse and perform complex reasoning over streams of data.
We study whether Convolutional and Recurrent Neural Networks, which have shown to be particularly well-suited for time series forecasting and classification, can be trained to approximate reasoning with LASER.
arXiv Detail & Related papers (2021-06-15T22:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.