Text Ranking and Classification using Data Compression
- URL: http://arxiv.org/abs/2109.11577v1
- Date: Thu, 23 Sep 2021 18:13:17 GMT
- Title: Text Ranking and Classification using Data Compression
- Authors: Nitya Kasturi, Igor L. Markov
- Abstract summary: We propose a language-agnostic approach to text categorization.
We use the Zstandard compressor and strengthen these ideas in several ways, calling the resulting technique Zest.
We show that Zest complements and can compete with language-specific multidimensional content embeddings in production, but cannot outperform other counting methods on public datasets.
- Score: 1.332560004325655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A well-known but rarely used approach to text categorization uses conditional
entropy estimates computed using data compression tools. Text affinity scores
derived from compressed sizes can be used for classification and ranking tasks,
but their success depends on the compression tools used. We use the Zstandard
compressor and strengthen these ideas in several ways, calling the resulting
language-agnostic technique Zest. In applications, this approach simplifies
configuration, avoiding careful feature extraction and large ML models. Our
ablation studies confirm the value of individual enhancements we introduce. We
show that Zest complements and can compete with language-specific
multidimensional content embeddings in production, but cannot outperform other
counting methods on public datasets.
Related papers
- Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - Concise and Precise Context Compression for Tool-Using Language Models [60.606281074373136]
We propose two strategies for compressing tool documentation into concise and precise summary sequences for tool-using language models.
Results on API-Bank and APIBench show that our approach reaches a performance comparable to the upper-bound baseline under up to 16x compression ratio.
arXiv Detail & Related papers (2024-07-02T08:17:00Z) - Ranking LLMs by compression [13.801767671391604]
We use five large language models as priors for compression, then compare their performance on challenging natural language processing tasks.
Experimental results show that compression ratio and model performance are positively correlated, so it can be used as a general metric to evaluate large language models.
arXiv Detail & Related papers (2024-06-20T10:23:38Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.
We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.
We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - Approximating Human-Like Few-shot Learning with GPT-based Compression [55.699707962017975]
We seek to equip generative pre-trained models with human-like learning capabilities that enable data compression during inference.
We present a novel approach that utilizes the Generative Pre-trained Transformer (GPT) to approximate Kolmogorov complexity.
arXiv Detail & Related papers (2023-08-14T05:22:33Z) - A comprehensive empirical analysis on cross-domain semantic enrichment
for detection of depressive language [0.9749560288448115]
We start with a rich word embedding pre-trained from a large general dataset, which is then augmented with embeddings learned from a much smaller and more specific domain dataset through a simple non-linear mapping mechanism.
We show that our augmented word embedding representations achieve a significantly better F1 score than the others, specially when applied to a high quality dataset.
arXiv Detail & Related papers (2021-06-24T07:15:09Z) - A Comparative Study on Structural and Semantic Properties of Sentence
Embeddings [77.34726150561087]
We propose a set of experiments using a widely-used large-scale data set for relation extraction.
We show that different embedding spaces have different degrees of strength for the structural and semantic properties.
These results provide useful information for developing embedding-based relation extraction methods.
arXiv Detail & Related papers (2020-09-23T15:45:32Z) - Sentence Compression as Deletion with Contextual Embeddings [3.3263205689999444]
We exploit contextual embeddings that enable our model capturing the context of inputs.
Experimental results on a benchmark Google dataset show that by utilizing contextual embeddings, our model achieves a new state-of-the-art F-score.
arXiv Detail & Related papers (2020-06-05T02:40:46Z) - Exploiting Class Labels to Boost Performance on Embedding-based Text
Classification [16.39344929765961]
embeddings of different kinds have recently become the de facto standard as features used for text classification.
We introduce a weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings.
arXiv Detail & Related papers (2020-06-03T08:53:40Z) - Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix.
On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z) - Learning Directly from Grammar Compressed Text [17.91878224879985]
We propose a method to apply neural sequence models to text data compressed with grammar compression algorithms without decompression.
To encode the unique symbols that appear in compression rules, we introduce composer modules to incrementally encode the symbols into vector representations.
arXiv Detail & Related papers (2020-02-28T06:51:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.