An Enhancement of Jiang, Z., et al.s Compression-Based Classification Algorithm Applied to News Article Categorization
- URL: http://arxiv.org/abs/2502.14444v1
- Date: Thu, 20 Feb 2025 10:50:59 GMT
- Title: An Enhancement of Jiang, Z., et al.s Compression-Based Classification Algorithm Applied to News Article Categorization
- Authors: Sean Lester C. Benavides, Cid Antonio F. Masapol, Jonathan C. Morano, Dan Michael A. Cortez,
- Abstract summary: This study enhances Jiang et al.'s compression-based classification algorithm by addressing its limitations in detecting semantic similarities between text documents.
The proposed improvements focus on unigram extraction and optimized concatenation, eliminating reliance on entire document compression.
Experimental results across datasets of varying sizes and complexities demonstrate an average accuracy improvement of 5.73%, with gains of up to 11% on datasets containing longer documents.
- Score: 0.0
- License:
- Abstract: This study enhances Jiang et al.'s compression-based classification algorithm by addressing its limitations in detecting semantic similarities between text documents. The proposed improvements focus on unigram extraction and optimized concatenation, eliminating reliance on entire document compression. By compressing extracted unigrams, the algorithm mitigates sliding window limitations inherent to gzip, improving compression efficiency and similarity detection. The optimized concatenation strategy replaces direct concatenation with the union of unigrams, reducing redundancy and enhancing the accuracy of Normalized Compression Distance (NCD) calculations. Experimental results across datasets of varying sizes and complexities demonstrate an average accuracy improvement of 5.73%, with gains of up to 11% on datasets containing longer documents. Notably, these improvements are more pronounced in datasets with high-label diversity and complex text structures. The methodology achieves these results while maintaining computational efficiency, making it suitable for resource-constrained environments. This study provides a robust, scalable solution for text classification, emphasizing lightweight preprocessing techniques to achieve efficient compression, which in turn enables more accurate classification.
Related papers
- Rethinking Large-scale Dataset Compression: Shifting Focus From Labels to Images [60.42768987736088]
We introduce a benchmark that equitably evaluates methodologies across both distillation and pruning literatures.
Our benchmark reveals that in the mainstream dataset distillation setting for large-scale datasets, even randomly selected subsets can achieve surprisingly competitive performance.
We propose a new framework for dataset compression, termed Prune, Combine, and Augment (PCA), which focuses on leveraging image data exclusively.
arXiv Detail & Related papers (2025-02-10T13:11:40Z) - Accelerated Methods with Compressed Communications for Distributed Optimization Problems under Data Similarity [55.03958223190181]
We propose the first theoretically grounded accelerated algorithms utilizing unbiased and biased compression under data similarity.
Our results are of record and confirmed by experiments on different average losses and datasets.
arXiv Detail & Related papers (2024-12-21T00:40:58Z) - An Enhanced Text Compression Approach Using Transformer-based Language Models [1.2937020918620652]
We propose a transformer-based method named RejuvenateForme for text decompression.
Our meticulous pre-processing technique incorporates the Le-Ziv-Welch algorithm.
The RejuvenateForme achieves a BLEU score of 27.31, 25.78, and 50.45 on the EN-DE, EN-FR, and BookCorpus corpora, showcasing its comprehensive efficacy.
arXiv Detail & Related papers (2024-12-15T03:01:17Z) - Lightweight Correlation-Aware Table Compression [58.50312417249682]
$texttVirtual$ is a framework that integrates seamlessly with existing open formats.
Experiments on data-gov datasets show that $texttVirtual$ reduces file sizes by up to 40% compared to Apache Parquet.
arXiv Detail & Related papers (2024-10-17T22:28:07Z) - A framework for compressing unstructured scientific data via serialization [2.5768995309704104]
We present a general framework for compressing unstructured scientific data with known local connectivity.
A common application is simulation data defined on arbitrary finite element meshes.
The framework employs a greedy topology preserving reordering of original nodes which allows for seamless integration into existing data processing pipelines.
arXiv Detail & Related papers (2024-10-10T15:53:35Z) - Channel-wise Feature Decorrelation for Enhanced Learned Image Compression [16.638869231028437]
The emerging Learned Compression (LC) replaces the traditional modules with Deep Neural Networks (DNN), which are trained end-to-end for rate-distortion performance.
This paper proposes to improve compression by fully exploiting the existing DNN capacity.
Three strategies are proposed and evaluated, which optimize (1) the transformation network, (2) the context model, and (3) both networks.
arXiv Detail & Related papers (2024-03-16T14:30:25Z) - Lower Bounds and Accelerated Algorithms in Distributed Stochastic
Optimization with Communication Compression [31.107056382542417]
Communication compression is an essential strategy for alleviating communication overhead.
We propose NEOLITHIC, a nearly optimal algorithm for compression under mild conditions.
arXiv Detail & Related papers (2023-05-12T17:02:43Z) - Learning Accurate Performance Predictors for Ultrafast Automated Model
Compression [86.22294249097203]
We propose an ultrafast automated model compression framework called SeerNet for flexible network deployment.
Our method achieves competitive accuracy-complexity trade-offs with significant reduction of the search cost.
arXiv Detail & Related papers (2023-04-13T10:52:49Z) - Implicit Neural Representations for Image Compression [103.78615661013623]
Implicit Neural Representations (INRs) have gained attention as a novel and effective representation for various data types.
We propose the first comprehensive compression pipeline based on INRs including quantization, quantization-aware retraining and entropy coding.
We find that our approach to source compression with INRs vastly outperforms similar prior work.
arXiv Detail & Related papers (2021-12-08T13:02:53Z) - Text Compression-aided Transformer Encoding [77.16960983003271]
We propose explicit and implicit text compression approaches to enhance the Transformer encoding.
backbone information, meaning the gist of the input text, is not specifically focused on.
Our evaluation on benchmark datasets shows that the proposed explicit and implicit text compression approaches improve results in comparison to strong baselines.
arXiv Detail & Related papers (2021-02-11T11:28:39Z) - Optimal Gradient Compression for Distributed and Federated Learning [9.711326718689492]
Communication between computing nodes in distributed learning is typically an unavoidable burden.
Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques.
In this paper, we investigate the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error.
arXiv Detail & Related papers (2020-10-07T07:58:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.