byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings
- URL: http://arxiv.org/abs/2106.13302v1
- Date: Thu, 24 Jun 2021 20:14:48 GMT
- Title: byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings
- Authors: Xiang Zhang, Alexandre Drouin, Raymond Li
- Abstract summary: We introduce byteSteady, a fast model for classification using byte-level n-gram embeddings.
A straightforward application of byteSteady is text classification.
We also apply byteSteady to one type of non-language data -- DNA sequences for gene classification.
- Score: 77.6701264226519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This article introduces byteSteady -- a fast model for classification using
byte-level n-gram embeddings. byteSteady assumes that each input comes as a
sequence of bytes. A representation vector is produced using the averaged
embedding vectors of byte-level n-grams, with a pre-defined set of n. The
hashing trick is used to reduce the number of embedding vectors. This input
representation vector is then fed into a linear classifier. A straightforward
application of byteSteady is text classification. We also apply byteSteady to
one type of non-language data -- DNA sequences for gene classification. For
both problems we achieved competitive classification results against strong
baselines, suggesting that byteSteady can be applied to both language and
non-language data. Furthermore, we find that simple compression using Huffman
coding does not significantly impact the results, which offers an
accuracy-speed trade-off previously unexplored in machine learning.
Related papers
- Classification Done Right for Vision-Language Pre-Training [66.90286715149786]
We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data.
SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection.
SuperClass demonstrated superior performance on various downstream tasks, including classic computer vision benchmarks and vision language downstream tasks.
arXiv Detail & Related papers (2024-11-05T18:58:15Z) - Ordered and Binary Speaker Embedding [12.22202088781098]
We propose an ordered binary embedding approach that sorts the dimensions of the embedding vector via a nested dropout and converts the sorted vectors to binary codes via Bernoulli sampling.
The resultant ordered binary codes offer some important merits such as hierarchical clustering, reduced memory usage, and fast retrieval.
arXiv Detail & Related papers (2023-05-25T13:21:00Z) - A Byte Sequence is Worth an Image: CNN for File Fragment Classification
Using Bit Shift and n-Gram Embeddings [21.14735408046021]
File fragment classification (FFC) on small chunks of memory is essential in memory forensics and Internet security.
Existing methods mainly treat file fragments as 1d byte signals and utilize the captured inter-byte features for classification.
We propose Byte2Image, a novel data augmentation technique, to introduce the neglected intra-byte information into file fragments and re-treat them as 2d gray-scale images.
arXiv Detail & Related papers (2023-04-14T08:06:52Z) - Optimizing Bi-Encoder for Named Entity Recognition via Contrastive
Learning [80.36076044023581]
We present an efficient bi-encoder framework for named entity recognition (NER)
We frame NER as a metric learning problem that maximizes the similarity between the vector representations of an entity mention and its type.
A major challenge to this bi-encoder formulation for NER lies in separating non-entity spans from entity mentions.
arXiv Detail & Related papers (2022-08-30T23:19:04Z) - Local Byte Fusion for Neural Machine Translation [19.16966721276286]
Subword tokenization schemes are the dominant technique used in current NLP models.
Byte-based methods i.e. tokenization into byte sequences are an alternative.
Experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional models.
arXiv Detail & Related papers (2022-05-23T17:49:02Z) - Language-driven Semantic Segmentation [88.21498323896475]
We present LSeg, a novel model for language-driven semantic image segmentation.
We use a text encoder to compute embeddings of descriptive input labels.
The encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class.
arXiv Detail & Related papers (2022-01-10T18:59:10Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Neural Machine Translation without Embeddings [44.129310924201604]
Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and subword induction algorithms.
A simple universal alternative is to represent every computerized text as a sequence of bytes via-8.
Experiments on byteto-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models.
arXiv Detail & Related papers (2020-08-21T09:54:11Z) - Learning Directly from Grammar Compressed Text [17.91878224879985]
We propose a method to apply neural sequence models to text data compressed with grammar compression algorithms without decompression.
To encode the unique symbols that appear in compression rules, we introduce composer modules to incrementally encode the symbols into vector representations.
arXiv Detail & Related papers (2020-02-28T06:51:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.