Hypernym Mercury: Token Optimization Through Semantic Field Constriction And Reconstruction From Hypernyms. A New Text Compression Method
- URL: http://arxiv.org/abs/2505.08058v2
- Date: Wed, 14 May 2025 20:57:31 GMT
- Title: Hypernym Mercury: Token Optimization Through Semantic Field Constriction And Reconstruction From Hypernyms. A New Text Compression Method
- Authors: Chris Forrester, Octavia Sulea,
- Abstract summary: We introduce a novel (patent pending) text representation scheme and a first-of-its-kind word-level semantic compression of paragraphs.<n>We show how our results hold at the paragraph level, across multiple genres and models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compute optimization using token reduction of LLM prompts is an emerging task in the fields of NLP and next generation, agentic AI. In this white paper, we introduce a novel (patent pending) text representation scheme and a first-of-its-kind word-level semantic compression of paragraphs that can lead to over 90% token reduction, while retaining high semantic similarity to the source text. We explain how this novel compression technique can be lossless and how the detail granularity is controllable. We discuss benchmark results over open source data (i.e. Bram Stoker's Dracula available through Project Gutenberg) and show how our results hold at the paragraph level, across multiple genres and models.
Related papers
- Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment [69.67015515485349]
We propose AutoRegEmbed, a contrastive learning method built on embedding conditional probability distributions.<n>We show that our method significantly outperforms traditional contrastive learning approaches.
arXiv Detail & Related papers (2025-02-17T03:36:25Z) - Token-Level Graphs for Short Text Classification [1.6819960041696331]
We propose an approach which constructs text graphs entirely based on tokens obtained through pre-trained language models (PLMs)<n>Our method captures contextual and semantic information, overcomes vocabulary constraints, and allows for context-dependent word meanings.<n> Experimental results demonstrate how our method consistently achieves higher scores or on-par performance with existing methods.
arXiv Detail & Related papers (2024-12-17T10:19:44Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>We introduce novel methodologies and datasets to overcome these challenges.<n>We propose MhBART, an encoder-decoder model designed to emulate human writing style.<n>We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference [16.830389144259584]
We propose context-aware prompt compression (CPC), a sentence-level prompt compression technique.<n>Key innovation is a novel context-aware sentence encoder that provides a relevance score for each sentence for a given question.<n>Our method considerably outperforms prior works on prompt compression on benchmark datasets.
arXiv Detail & Related papers (2024-09-02T13:02:51Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.<n>We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.<n>We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation [72.10931780019297]
Existing watermarking algorithms are vulnerable to paraphrase attacks because of their token-level design.
We propose SemStamp, a robust sentence-level semantic watermarking algorithm based on locality-sensitive hashing (LSH)
Experimental results show that our novel semantic watermark algorithm is not only more robust than the previous state-of-the-art method on both common and bigram paraphrase attacks, but also is better at preserving the quality of generation.
arXiv Detail & Related papers (2023-10-06T03:33:42Z) - Tram: A Token-level Retrieval-augmented Mechanism for Source Code Summarization [76.57699934689468]
We propose a fine-grained Token-level retrieval-augmented mechanism (Tram) on the decoder side to enhance the performance of neural models.
To overcome the challenge of token-level retrieval in capturing contextual code semantics, we also propose integrating code semantics into individual summary tokens.
arXiv Detail & Related papers (2023-05-18T16:02:04Z) - Semantic Compression With Large Language Models [1.0874100424278175]
Large language models (LLMs) are revolutionizing information retrieval, question answering, summarization, and code generation tasks.
LLMs are inherently limited by the number of input and output tokens that can be processed at once.
This paper presents three contributions to research on LLMs.
arXiv Detail & Related papers (2023-04-25T01:47:05Z) - Momentum Decoding: Open-ended Text Generation As Graph Exploration [49.812280360794894]
Open-ended text generation with autoregressive language models (LMs) is one of the core tasks in natural language processing.
We formulate open-ended text generation from a new perspective, i.e., we view it as an exploration process within a directed graph.
We propose a novel decoding method -- textitmomentum decoding -- which encourages the LM to explore new nodes outside the current graph.
arXiv Detail & Related papers (2022-12-05T11:16:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.