LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding
- URL: http://arxiv.org/abs/2505.16983v2
- Date: Thu, 29 May 2025 13:22:18 GMT
- Title: LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding
- Authors: Junlong Tong, Jinlan Fu, Zixuan Lin, Yingqi Fan, Anhao Zhao, Hui Su, Xiaoyu Shen,
- Abstract summary: It is commonly assumed that the latter two mismatches require frequent re-encoding, indicating re-encoding outputs is largely unnecessary.<n>We introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes.<n>Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes.
- Score: 29.586274567275012
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository https://github.com/EIT-NLP/StreamingLLM.
Related papers
- InferLog: Accelerating LLM Inference for Online Log Parsing via ICL-oriented Prefix Caching [23.53991589506265]
We present InferLog, the first inference optimization method for online log parsing.<n>InferLog accelerates inference by designing (1) A Prefix-aware ICL Refinement policy to refine the examples and permutation of in-context learning to improve the prefix caching efficiency.
arXiv Detail & Related papers (2025-07-11T12:21:29Z) - Flipping Knowledge Distillation: Leveraging Small Models' Expertise to Enhance LLMs in Text Matching [16.725632407644884]
We introduce a flipped knowledge distillation paradigm, where a Large Language Model learns from a Smaller Language Model.<n>Specifically, we address the architectural gap between decoder-only LLMs and smaller encoder-based models.<n> Experiments on financial and healthcare benchmarks, as well as real-world applications, confirm its effectiveness.
arXiv Detail & Related papers (2025-07-08T02:54:15Z) - DecoRTL: A Run-time Decoding Framework for RTL Code Generation with LLMs [0.0]
We show that large language models (LLMs) exhibit low confidence in regions of structural ambiguity or semantic complexity.<n>We introduce DecoRTL, a novel run-time decoding strategy, that is both syntax-aware and contrastive for RTL code generation.<n>Our approach operates entirely at inference time without requiring any additional model fine-tuning.
arXiv Detail & Related papers (2025-07-03T01:17:44Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z) - ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning [68.76048244253582]
We introduce ViaRL, the first framework to leverage rule-based reinforcement learning (RL) for optimizing frame selection in video understanding.<n>ViaRL utilizes the answer accuracy of a downstream model as a reward signal to train a frame selector through trial-and-error.<n>ViaRL consistently delivers superior temporal grounding performance and robust generalization across diverse video understanding tasks.
arXiv Detail & Related papers (2025-05-21T12:29:40Z) - Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - StreamBench: Towards Benchmarking Continuous Improvement of Language Agents [63.54557575233165]
Large language model (LLM) agents are able to improve themselves from experience, which is an important ability for continuous enhancement post-deployment.
We introduce StreamBench, a benchmark designed to evaluate the continuous improvement of LLM agents over an input-feedback sequence.
Our work serves as a stepping stone towards developing effective online learning strategies for LLMs, paving the way for more adaptive AI systems in streaming scenarios.
arXiv Detail & Related papers (2024-06-13T02:08:28Z) - CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning [59.88924847995279]
We propose a novel Cross-Modal LLM Fine-Tuning (CALF) framework for MTSF.<n>To reduce the distribution discrepancy, we develop the cross-modal match module.<n>CALF establishes state-of-the-art performance for both long-term and short-term forecasting tasks.
arXiv Detail & Related papers (2024-03-12T04:04:38Z) - CLLMs: Consistency Large Language Models [18.17892007267927]
Jacobi decoding achieves little speedup compared to traditional autoregressive (AR) decoding.
We develop a new approach aimed at realizing fast convergence from any state to the fixed point on a Jacobi trajectory.
arXiv Detail & Related papers (2024-02-28T20:17:04Z) - LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models [83.98062659664785]
Large language models (LLMs) typically train on short text segments (e.g., 4K tokens) due to the quadratic complexity of their Transformer architectures.
This work identifies three major factors contributing to this length generalization failure.
We propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts.
arXiv Detail & Related papers (2023-08-30T16:47:51Z) - A Coding Framework and Benchmark towards Low-Bitrate Video Understanding [63.05385140193666]
We propose a traditional-neural mixed coding framework that takes advantage of both traditional codecs and neural networks (NNs)
The framework is optimized by ensuring that a transportation-efficient semantic representation of the video is preserved.
We build a low-bitrate video understanding benchmark with three downstream tasks on eight datasets, demonstrating the notable superiority of our approach.
arXiv Detail & Related papers (2022-02-06T16:29:15Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.