Related papers: Text Ranking and Classification using Data Compression

Text Ranking and Classification using Data Compression

URL: http://arxiv.org/abs/2109.11577v1
Date: Thu, 23 Sep 2021 18:13:17 GMT
Title: Text Ranking and Classification using Data Compression
Authors: Nitya Kasturi, Igor L. Markov
Abstract summary: We propose a language-agnostic approach to text categorization. We use the Zstandard compressor and strengthen these ideas in several ways, calling the resulting technique Zest. We show that Zest complements and can compete with language-specific multidimensional content embeddings in production, but cannot outperform other counting methods on public datasets.
Score: 1.332560004325655
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A well-known but rarely used approach to text categorization uses conditional entropy estimates computed using data compression tools. Text affinity scores derived from compressed sizes can be used for classification and ranking tasks, but their success depends on the compression tools used. We use the Zstandard compressor and strengthen these ideas in several ways, calling the resulting language-agnostic technique Zest. In applications, this approach simplifies configuration, avoiding careful feature extraction and large ML models. Our ablation studies confirm the value of individual enhancements we introduce. We show that Zest complements and can compete with language-specific multidimensional content embeddings in production, but cannot outperform other counting methods on public datasets.

Related papers

AIDetx: a compression-based method for identification of machine-learning generated text [0.8411424745913135]
This paper introduces AIDetx, a novel method for detecting machine-generated text using data compression techniques. We evaluate AIDetx on two benchmark datasets, achieving F1 scores exceeding 97% and 99%, respectively. Compared to current methods, such as large language models (LLMs), AIDetx offers a more interpretable and computationally efficient solution.
arXiv Detail & Related papers (2024-11-29T17:31:42Z)
Mixed-Precision Embeddings for Large-Scale Recommendation Models [19.93156309493436]
Mixed-Precision Embeddings (MPE) is a novel embedding compression method. MPE achieves about 200x compression on the Criteo dataset without comprising the prediction accuracy.
arXiv Detail & Related papers (2024-09-30T14:04:27Z)
Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm. It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z)
Concise and Precise Context Compression for Tool-Using Language Models [60.606281074373136]
We propose two strategies for compressing tool documentation into concise and precise summary sequences for tool-using language models. Results on API-Bank and APIBench show that our approach reaches a performance comparable to the upper-bound baseline under up to 16x compression ratio.
arXiv Detail & Related papers (2024-07-02T08:17:00Z)
Ranking LLMs by compression [13.801767671391604]
We use five large language models as priors for compression, then compare their performance on challenging natural language processing tasks. Experimental results show that compression ratio and model performance are positively correlated, so it can be used as a general metric to evaluate large language models.
arXiv Detail & Related papers (2024-06-20T10:23:38Z)
Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text. We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z)
Approximating Human-Like Few-shot Learning with GPT-based Compression [55.699707962017975]
We seek to equip generative pre-trained models with human-like learning capabilities that enable data compression during inference. We present a novel approach that utilizes the Generative Pre-trained Transformer (GPT) to approximate Kolmogorov complexity.
arXiv Detail & Related papers (2023-08-14T05:22:33Z)
A comprehensive empirical analysis on cross-domain semantic enrichment for detection of depressive language [0.9749560288448115]
We start with a rich word embedding pre-trained from a large general dataset, which is then augmented with embeddings learned from a much smaller and more specific domain dataset through a simple non-linear mapping mechanism. We show that our augmented word embedding representations achieve a significantly better F1 score than the others, specially when applied to a high quality dataset.
arXiv Detail & Related papers (2021-06-24T07:15:09Z)
A Comparative Study on Structural and Semantic Properties of Sentence Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction. We show that different embedding spaces have different degrees of strength for the structural and semantic properties. These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z)
Exploiting Class Labels to Boost Performance on Embedding-based Text Classification [16.39344929765961]
embeddings of different kinds have recently become the de facto standard as features used for text classification. We introduce a weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings.
arXiv Detail & Related papers (2020-06-03T08:53:40Z)
Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix. On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z)
Learning Directly from Grammar Compressed Text [17.91878224879985]
We propose a method to apply neural sequence models to text data compressed with grammar compression algorithms without decompression. To encode the unique symbols that appear in compression rules, we introduce composer modules to incrementally encode the symbols into vector representations.
arXiv Detail & Related papers (2020-02-28T06:51:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.