Related papers: XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression

URL: http://arxiv.org/abs/2602.21780v1
Date: Wed, 25 Feb 2026 11:02:02 GMT
Title: XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression
Authors: Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, Ngai Wong,
Abstract summary: XStreamVGGT is a tuning-free approach that seamlessly integrates pruning and quantization to systematically compress the KV cache.<n>XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$times$.
Score: 20.18561757219652
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Learning-based 3D visual geometry models have significantly advanced with the advent of large-scale transformers. Among these, StreamVGGT leverages frame-wise causal attention to deliver robust and efficient streaming 3D reconstruction. However, it suffers from unbounded growth in the Key-Value (KV) cache due to the massive influx of vision tokens from multi-image and long-video inputs, leading to increased memory consumption and inference latency as input frames accumulate. This ultimately limits its scalability for long-horizon applications. To address this gap, we propose XStreamVGGT, a tuning-free approach that seamlessly integrates pruning and quantization to systematically compress the KV cache, enabling extremely memory-efficient streaming inference. Specifically, redundant KVs generated from multi-frame inputs are initially pruned to conform to a fixed KV memory budget using an efficient token-importance identification mechanism that maintains full compatibility with high-performance attention kernels (e.g., FlashAttention). Additionally, leveraging the inherent distribution patterns of KV tensors, we apply dimension-adaptive KV quantization within the pruning pipeline to further minimize memory overhead while preserving numerical accuracy. Extensive evaluations show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage by 4.42$\times$ and accelerating inference by 5.48$\times$, enabling practical and scalable streaming 3D applications. The code is available at https://github.com/ywh187/XStreamVGGT/.

Related papers

DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity [50.52392445266824]
We propose a residual-based KV cache compression framework motivated by long-range inter-token similarity and highly shared latent components in KV representations.<n>Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage.<n>Experiments show that DeltaKV reduces KV cache memory to 29% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME.
arXiv Detail & Related papers (2026-02-08T15:14:36Z)
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization [83.406036390582]
Quant VideoGen (QVG) is a training free KV cache quantization framework for autoregressive video diffusion models.<n>It reduces KV memory by up to 7.0 times with less than 4% end to end latency overhead.<n>It consistently outperforms existing baselines in generation quality.
arXiv Detail & Related papers (2026-02-03T00:54:32Z)
PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache [61.57938553036056]
We introduce PackCache, a training-free KV-cache management method that compacts the KV cache through three coordinated mechanisms.<n>In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences.
arXiv Detail & Related papers (2026-01-07T19:51:06Z)
XStreamVGGT: Extremely Memory-Efficient Streaming Vision Geometry Grounded Transformer with KV Cache Compression [20.18561757219652]
XStreamVGGT is a tuning-free approach that compresses the KV cache through joint pruning and quantization.<n>We show that XStreamVGGT achieves mostly negligible performance degradation while substantially reducing memory usage.
arXiv Detail & Related papers (2026-01-03T14:59:50Z)
PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression [8.427136461713706]
We present textbfPackKV, a generic and efficient KV cache management framework.<n>PackKV supports both latency-critical and throughput-critical inference scenarios.
arXiv Detail & Related papers (2025-12-30T20:05:32Z)
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache [7.019967158501771]
We present KVComp, a generic and efficient KV cache management framework optimized for long-text generation.<n> KVComp employs novel lossy compression techniques specifically designed for KV cache data characteristics.<n>We show that KVComp achieves on average 47% and up to 83% higher memory reduction rate compared to existing methods.
arXiv Detail & Related papers (2025-08-30T18:25:19Z)
TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization [21.229296254354878]
Key-Value cache in generative large language models (LLMs) introduces substantial memory overhead.<n>Existing works mitigate this burden by offloading or compressing the KV cache.<n>We propose a hybrid compression method, TailorKV, which seamlessly integrates quantization and offloading.
arXiv Detail & Related papers (2025-05-26T07:00:04Z)
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction [33.936381781692994]
DiffKV is a novel framework for efficient KV cache compression.<n>It exploits three levels of differentiation in the KV cache.<n>It is able to compress the KV cache by $2.7times$ to $5.7times$ with near-lossless accuracy.
arXiv Detail & Related papers (2024-12-04T08:51:23Z)
KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing [58.29726147780976]
We propose a plug-and-play method called textit KVSharer, which shares the KV cache between layers to achieve layer-wise compression. Experiments show that textit KVSharer can reduce KV cache computation by 30%, thereby lowering memory consumption. We verify that textit KVSharer is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.
arXiv Detail & Related papers (2024-10-24T08:06:41Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache [67.9776980972508]
We develop a tuning-free 2bit KV cache quantization algorithm named KIVI. KIVI can enable Llama, Falcon, and Mistral models to maintain almost the same quality while using $mathbf2.6times$ less peak memory.
arXiv Detail & Related papers (2024-02-05T06:06:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.