Learning Directly from Grammar Compressed Text
- URL: http://arxiv.org/abs/2002.12570v1
- Date: Fri, 28 Feb 2020 06:51:40 GMT
- Title: Learning Directly from Grammar Compressed Text
- Authors: Yoichi Sasaki, Kosuke Akimoto, Takanori Maehara
- Abstract summary: We propose a method to apply neural sequence models to text data compressed with grammar compression algorithms without decompression.
To encode the unique symbols that appear in compression rules, we introduce composer modules to incrementally encode the symbols into vector representations.
- Score: 17.91878224879985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural networks using numerous text data have been successfully applied to a
variety of tasks. While massive text data is usually compressed using
techniques such as grammar compression, almost all of the previous machine
learning methods assume already decompressed sequence data as their input. In
this paper, we propose a method to directly apply neural sequence models to
text data compressed with grammar compression algorithms without decompression.
To encode the unique symbols that appear in compression rules, we introduce
composer modules to incrementally encode the symbols into vector
representations. Through experiments on real datasets, we empirically showed
that the proposal model can achieve both memory and computational efficiency
while maintaining moderate performance.
Related papers
- AlphaZip: Neural Network-Enhanced Lossless Text Compression [0.0]
This paper introduces a lossless text compression approach using a Large Language Model (LLM)
The method involves two key steps: first, prediction using a dense neural network architecture, such as a transformer block; second, compressing the predicted ranks with standard compression algorithms like Adaptive Huffman, LZ77, or Gzip.
arXiv Detail & Related papers (2024-09-23T14:21:06Z) - Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text.
We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length.
We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z) - Perceptual Image Compression with Cooperative Cross-Modal Side
Information [53.356714177243745]
We propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff.
Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features.
arXiv Detail & Related papers (2023-11-23T08:31:11Z) - Crossword: A Semantic Approach to Data Compression via Masking [38.107509264270924]
This study places careful emphasis on English text and exploits its semantic aspect to enhance the compression efficiency further.
The proposed masking-based strategy resembles the above game.
In a nutshell, the encoder evaluates the semantic importance of each word according to the semantic loss and then masks the minor ones, while the decoder aims to recover the masked words from the semantic context by means of the Transformer.
arXiv Detail & Related papers (2023-04-03T16:04:06Z) - Syntax-Aware Network for Handwritten Mathematical Expression Recognition [53.130826547287626]
Handwritten mathematical expression recognition (HMER) is a challenging task that has many potential applications.
Recent methods for HMER have achieved outstanding performance with an encoder-decoder architecture.
We propose a simple and efficient method for HMER, which is the first to incorporate syntax information into an encoder-decoder network.
arXiv Detail & Related papers (2022-03-03T09:57:19Z) - COIN++: Data Agnostic Neural Compression [55.27113889737545]
COIN++ is a neural compression framework that seamlessly handles a wide range of data modalities.
We demonstrate the effectiveness of our method by compressing various data modalities.
arXiv Detail & Related papers (2022-01-30T20:12:04Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Text Ranking and Classification using Data Compression [1.332560004325655]
We propose a language-agnostic approach to text categorization.
We use the Zstandard compressor and strengthen these ideas in several ways, calling the resulting technique Zest.
We show that Zest complements and can compete with language-specific multidimensional content embeddings in production, but cannot outperform other counting methods on public datasets.
arXiv Detail & Related papers (2021-09-23T18:13:17Z) - Text Compression-aided Transformer Encoding [77.16960983003271]
We propose explicit and implicit text compression approaches to enhance the Transformer encoding.
backbone information, meaning the gist of the input text, is not specifically focused on.
Our evaluation on benchmark datasets shows that the proposed explicit and implicit text compression approaches improve results in comparison to strong baselines.
arXiv Detail & Related papers (2021-02-11T11:28:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.