Crossword: A Semantic Approach to Data Compression via Masking
- URL: http://arxiv.org/abs/2304.01106v1
- Date: Mon, 3 Apr 2023 16:04:06 GMT
- Title: Crossword: A Semantic Approach to Data Compression via Masking
- Authors: Mingxiao Li, Rui Jin, Liyao Xiang, Kaiming Shen, Shuguang Cui
- Abstract summary: This study places careful emphasis on English text and exploits its semantic aspect to enhance the compression efficiency further.
The proposed masking-based strategy resembles the above game.
In a nutshell, the encoder evaluates the semantic importance of each word according to the semantic loss and then masks the minor ones, while the decoder aims to recover the masked words from the semantic context by means of the Transformer.
- Score: 38.107509264270924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The traditional methods for data compression are typically based on the
symbol-level statistics, with the information source modeled as a long sequence
of i.i.d. random variables or a stochastic process, thus establishing the
fundamental limit as entropy for lossless compression and as mutual information
for lossy compression. However, the source (including text, music, and speech)
in the real world is often statistically ill-defined because of its close
connection to human perception, and thus the model-driven approach can be quite
suboptimal. This study places careful emphasis on English text and exploits its
semantic aspect to enhance the compression efficiency further. The main idea
stems from the puzzle crossword, observing that the hidden words can still be
precisely reconstructed so long as some key letters are provided. The proposed
masking-based strategy resembles the above game. In a nutshell, the encoder
evaluates the semantic importance of each word according to the semantic loss
and then masks the minor ones, while the decoder aims to recover the masked
words from the semantic context by means of the Transformer. Our experiments
show that the proposed semantic approach can achieve much higher compression
efficiency than the traditional methods such as Huffman code and UTF-8 code,
while preserving the meaning in the target text to a great extent.
Related papers
- AlphaZip: Neural Network-Enhanced Lossless Text Compression [0.0]
This paper introduces a lossless text compression approach using a Large Language Model (LLM)
The method involves two key steps: first, prediction using a dense neural network architecture, such as a transformer block; second, compressing the predicted ranks with standard compression algorithms like Adaptive Huffman, LZ77, or Gzip.
arXiv Detail & Related papers (2024-09-23T14:21:06Z) - SMC++: Masked Learning of Unsupervised Video Semantic Compression [54.62883091552163]
We propose a Masked Video Modeling (MVM)-powered compression framework that particularly preserves video semantics.
MVM is proficient at learning generalizable semantics through the masked patch prediction task.
It may also encode non-semantic information like trivial textural details, wasting bitcost and bringing semantic noises.
arXiv Detail & Related papers (2024-06-07T09:06:40Z) - Perceptual Image Compression with Cooperative Cross-Modal Side
Information [53.356714177243745]
We propose a novel deep image compression method with text-guided side information to achieve a better rate-perception-distortion tradeoff.
Specifically, we employ the CLIP text encoder and an effective Semantic-Spatial Aware block to fuse the text and image features.
arXiv Detail & Related papers (2023-11-23T08:31:11Z) - Semantic Text Compression for Classification [17.259824817932294]
We study semantic compression for text where meanings contained in the text are conveyed to a source decoder, e.g., for classification.
We propose semantic quantization and compression approaches for text where we utilize sentence embeddings and the semantic distortion metric to preserve the meaning.
arXiv Detail & Related papers (2023-09-19T17:50:57Z) - EntropyRank: Unsupervised Keyphrase Extraction via Side-Information
Optimization for Language Model-based Text Compression [62.261476176242724]
We propose an unsupervised method to extract keywords and keyphrases from texts based on a pre-trained language model (LM) and Shannon's information.
Specifically, our method extracts phrases having the highest conditional entropy under the LM.
arXiv Detail & Related papers (2023-08-25T14:23:40Z) - Towards Semantic Communications: Deep Learning-Based Image Semantic
Coding [42.453963827153856]
We conceive the semantic communications for image data that is much more richer in semantics and bandwidth sensitive.
We propose an reinforcement learning based adaptive semantic coding (RL-ASC) approach that encodes images beyond pixel level.
Experimental results demonstrate that the proposed RL-ASC is noise robust and could reconstruct visually pleasant and semantic consistent image.
arXiv Detail & Related papers (2022-08-08T12:29:55Z) - Semantic-Preserving Adversarial Text Attacks [85.32186121859321]
We propose a Bigram and Unigram based adaptive Semantic Preservation Optimization (BU-SPO) method to examine the vulnerability of deep models.
Our method achieves the highest attack success rates and semantics rates by changing the smallest number of words compared with existing methods.
arXiv Detail & Related papers (2021-08-23T09:05:18Z) - Text Compression-aided Transformer Encoding [77.16960983003271]
We propose explicit and implicit text compression approaches to enhance the Transformer encoding.
backbone information, meaning the gist of the input text, is not specifically focused on.
Our evaluation on benchmark datasets shows that the proposed explicit and implicit text compression approaches improve results in comparison to strong baselines.
arXiv Detail & Related papers (2021-02-11T11:28:39Z) - Learning Directly from Grammar Compressed Text [17.91878224879985]
We propose a method to apply neural sequence models to text data compressed with grammar compression algorithms without decompression.
To encode the unique symbols that appear in compression rules, we introduce composer modules to incrementally encode the symbols into vector representations.
arXiv Detail & Related papers (2020-02-28T06:51:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.