Related papers: An Enhanced Text Compression Approach Using Transformer-based Language Models

An Enhanced Text Compression Approach Using Transformer-based Language Models

URL: http://arxiv.org/abs/2412.15250v1
Date: Sun, 15 Dec 2024 03:01:17 GMT
Title: An Enhanced Text Compression Approach Using Transformer-based Language Models
Authors: Chowdhury Mofizur Rahman, Mahbub E Sobhani, Anika Tasnim Rodela, Swakkhar Shatabda,
Abstract summary: We propose a transformer-based method named RejuvenateForme for text decompression.<n>Our meticulous pre-processing technique incorporates the Le-Ziv-Welch algorithm.<n>The RejuvenateForme achieves a BLEU score of 27.31, 25.78, and 50.45 on the EN-DE, EN-FR, and BookCorpus corpora, showcasing its comprehensive efficacy.
Score: 1.2937020918620652
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text compression shrinks textual data while keeping crucial information, eradicating constraints on storage, bandwidth, and computational efficacy. The integration of lossless compression techniques with transformer-based text decompression has received negligible attention, despite the increasing volume of English text data in communication. The primary barrier in advancing text compression and restoration involves optimizing transformer-based approaches with efficient pre-processing and integrating lossless compression algorithms, that remained unresolved in the prior attempts. Here, we propose a transformer-based method named RejuvenateForme for text decompression, addressing prior issues by harnessing a new pre-processing technique and a lossless compression method. Our meticulous pre-processing technique incorporating the Lempel-Ziv-Welch algorithm achieves compression ratios of 12.57, 13.38, and 11.42 on the BookCorpus, EN-DE, and EN-FR corpora, thus showing state-of-the-art compression ratios compared to other deep learning and traditional approaches. Furthermore, the RejuvenateForme achieves a BLEU score of 27.31, 25.78, and 50.45 on the EN-DE, EN-FR, and BookCorpus corpora, showcasing its comprehensive efficacy. In contrast, the pre-trained T5-Small exhibits better performance over prior state-of-the-art models.

Related papers

An Enhancement of Jiang, Z., et al.s Compression-Based Classification Algorithm Applied to News Article Categorization [0.0]
This study enhances Jiang et al.'s compression-based classification algorithm by addressing its limitations in detecting semantic similarities between text documents. The proposed improvements focus on unigram extraction and optimized concatenation, eliminating reliance on entire document compression. Experimental results across datasets of varying sizes and complexities demonstrate an average accuracy improvement of 5.73%, with gains of up to 11% on datasets containing longer documents.
arXiv Detail & Related papers (2025-02-20T10:50:59Z)
AlphaZip: Neural Network-Enhanced Lossless Text Compression [0.0]
This paper introduces a lossless text compression approach using a Large Language Model (LLM) The method involves two key steps: first, prediction using a dense neural network architecture, such as a transformer block; second, compressing the predicted ranks with standard compression algorithms like Adaptive Huffman, LZ77, or Gzip.
arXiv Detail & Related papers (2024-09-23T14:21:06Z)
MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework. MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions. Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z)
In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs) We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z)
Approximating Human-Like Few-shot Learning with GPT-based Compression [55.699707962017975]
We seek to equip generative pre-trained models with human-like learning capabilities that enable data compression during inference. We present a novel approach that utilizes the Generative Pre-trained Transformer (GPT) to approximate Kolmogorov complexity.
arXiv Detail & Related papers (2023-08-14T05:22:33Z)
Reducing The Amortization Gap of Entropy Bottleneck In End-to-End Image Compression [2.1485350418225244]
End-to-end deep trainable models are about to exceed the performance of the traditional handcrafted compression techniques on videos and images. We propose a simple yet efficient instance-based parameterization method to reduce this amortization gap at a minor cost.
arXiv Detail & Related papers (2022-09-02T11:43:45Z)
Implicit Neural Representations for Image Compression [103.78615661013623]
Implicit Neural Representations (INRs) have gained attention as a novel and effective representation for various data types. We propose the first comprehensive compression pipeline based on INRs including quantization, quantization-aware retraining and entropy coding. We find that our approach to source compression with INRs vastly outperforms similar prior work.
arXiv Detail & Related papers (2021-12-08T13:02:53Z)
On Effects of Compression with Hyperdimensional Computing in Distributed Randomized Neural Networks [6.25118865553438]
We propose a model for distributed classification based on randomized neural networks and hyperdimensional computing. In this work, we propose a more flexible approach to compression and compare it to conventional compression algorithms, dimensionality reduction, and quantization techniques.
arXiv Detail & Related papers (2021-06-17T22:02:40Z)
Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models. We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z)
Text Compression-aided Transformer Encoding [77.16960983003271]
We propose explicit and implicit text compression approaches to enhance the Transformer encoding. backbone information, meaning the gist of the input text, is not specifically focused on. Our evaluation on benchmark datasets shows that the proposed explicit and implicit text compression approaches improve results in comparison to strong baselines.
arXiv Detail & Related papers (2021-02-11T11:28:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.