Related papers: Lightweight Correlation-Aware Table Compression

Lightweight Correlation-Aware Table Compression

URL: http://arxiv.org/abs/2410.14066v3
Date: Thu, 24 Oct 2024 13:28:18 GMT
Title: Lightweight Correlation-Aware Table Compression
Authors: Mihail Stoian, Alexander van Renen, Jan Kobiolka, Ping-Lin Kuo, Josif Grabocka, Andreas Kipf,
Abstract summary: $texttVirtual$ is a framework that integrates seamlessly with existing open formats. Experiments on data-gov datasets show that $texttVirtual$ reduces file sizes by up to 40% compared to Apache Parquet.
Score: 58.50312417249682
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growing adoption of data lakes for managing relational data necessitates efficient, open storage formats that provide high scan performance and competitive compression ratios. While existing formats achieve fast scans through lightweight encoding techniques, they have reached a plateau in terms of minimizing storage footprint. Recently, correlation-aware compression schemes have been shown to reduce file sizes further. Yet, current approaches either incur significant scan overheads or require manual specification of correlations, limiting their practicability. We present $\texttt{Virtual}$, a framework that integrates seamlessly with existing open formats to automatically leverage data correlations, achieving substantial compression gains while having minimal scan performance overhead. Experiments on data-gov datasets show that $\texttt{Virtual}$ reduces file sizes by up to 40% compared to Apache Parquet.

Related papers

Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images [60.42768987736088]
We introduce a benchmark that equitably evaluates methodologies across both distillation and pruning literatures. Our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, even randomly selected subsets can achieve surprisingly competitive performance. We propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively.
arXiv Detail & Related papers (2025-02-10T13:11:40Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
End-to-end learned Lossy Dynamic Point Cloud Attribute Compression [5.717288278431968]
This study introduces an end-to-end learned dynamic lossy attribute coding approach. We employ a context model that leverage previous latent space in conjunction with an auto-regressive context model for encoding the latent tensor into a bitstream.
arXiv Detail & Related papers (2024-08-20T09:06:59Z)
Concise and Precise Context Compression for Tool-Using Language Models [60.606281074373136]
We propose two strategies for compressing tool documentation into concise and precise summary sequences for tool-using language models. Results on API-Bank and APIBench show that our approach reaches a performance comparable to the upper-bound baseline under up to 16x compression ratio.
arXiv Detail & Related papers (2024-07-02T08:17:00Z)
Sparse $L^1$-Autoencoders for Scientific Data Compression [0.0]
We introduce effective data compression methods by developing autoencoders using high dimensional latent spaces that are $L1$-regularized. We show how these information-rich latent spaces can be used to mitigate blurring and other artifacts to obtain highly effective data compression methods for scientific data.
arXiv Detail & Related papers (2024-05-23T07:48:00Z)
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression [43.048684907893104]
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT.
arXiv Detail & Related papers (2024-03-19T17:59:56Z)
Data-Aware Gradient Compression for FL in Communication-Constrained Mobile Computing [20.70238092277094]
Federated Learning (FL) in mobile environments faces significant communication bottlenecks. One-size-fits-all compression approach does not account for the varying data volumes across workers. We propose varying compression ratios to workers with distinct data distributions and volumes.
arXiv Detail & Related papers (2023-11-13T13:24:09Z)
Learning Accurate Performance Predictors for Ultrafast Automated Model Compression [86.22294249097203]
We propose an ultrafast automated model compression framework called SeerNet for flexible network deployment. Our method achieves competitive accuracy-complexity trade-offs with significant reduction of the search cost.
arXiv Detail & Related papers (2023-04-13T10:52:49Z)
ZipLM: Inference-Aware Structured Pruning of Language Models [56.52030193434863]
We propose a novel structured compression approach for large language models (LLMs) called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups. ZipLM produces state-of-the-art compressed models across all settings.
arXiv Detail & Related papers (2023-02-07T18:55:28Z)
Scalable Hybrid Learning Techniques for Scientific Data Compression [6.803722400888276]
Scientists require compression techniques that accurately preserve derived quantities of interest (QoIs) This paper presents a physics-informed compression technique implemented as an end-to-end, scalable, GPU-based pipeline for data compression.
arXiv Detail & Related papers (2022-12-21T03:00:18Z)
Dataset Condensation with Latent Space Knowledge Factorization and Sharing [73.31614936678571]
We introduce a novel approach for solving dataset condensation problem by exploiting the regularity in a given dataset. Instead of condensing the dataset directly in the original input space, we assume a generative process of the dataset with a set of learnable codes. We experimentally show that our method achieves new state-of-the-art records by significant margins on various benchmark datasets.
arXiv Detail & Related papers (2022-08-21T18:14:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.