Related papers: Leader: Prefixing a Length for Faster Word Vector Serialization

Leader: Prefixing a Length for Faster Word Vector Serialization

URL: http://arxiv.org/abs/2009.13699v2
Date: Fri, 9 Oct 2020 04:49:24 GMT
Title: Leader: Prefixing a Length for Faster Word Vector Serialization
Authors: Brian Lester
Abstract summary: Two file formats are used to distribute pre-trained word embeddings. The GloVe format is a text based format that suffers from huge file sizes and slow reads. The word2vec format is a smaller binary format that mixes a textual representation of words with a binary representation of the vectors themselves.
Score: 11.112281331309939
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Two competing file formats have become the de facto standards for distributing pre-trained word embeddings. Both are named after the most popular pre-trained embeddings that are distributed in that format. The GloVe format is an entirely text based format that suffers from huge file sizes and slow reads, and the word2vec format is a smaller binary format that mixes a textual representation of words with a binary representation of the vectors themselves. Both formats have problems that we solve with a new format we call the Leader format. We include a word length prefix for faster reads while maintaining the smaller file size a binary format offers. We also created a minimalist library to facilitate the reading and writing of various word vector formats, as well as tools for converting pre-trained embeddings to our new Leader format.

Related papers

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models [5.330795983408874]
We introduce a novel method called late chunking, which leverages long context embedding models to first embed all tokens of the long text. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks.
arXiv Detail & Related papers (2024-09-07T03:54:46Z)
InstructCMP: Length Control in Sentence Compression through Instruction-based Large Language Models [27.26285945442178]
InstructCMP is an approach to the sentence compression task that can consider the length constraint through instructions. We show that applying the length priming significantly improves performances of InstructCMP in both zero-shot and fine-tuning settings.
arXiv Detail & Related papers (2024-06-16T23:00:47Z)
Training LLMs over Neurally Compressed Text [55.11828645767342]
This paper explores the idea of training large language models (LLMs) over highly compressed text. We propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. We demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks.
arXiv Detail & Related papers (2024-04-04T17:48:28Z)
Transformers Can Achieve Length Generalization But Not Robustly [76.06308648699357]
We show that the success of length generalization is intricately linked to the data format and the type of position encoding. We show for the first time that standard Transformers can extrapolate to a sequence length that is 2.5x the input length.
arXiv Detail & Related papers (2024-02-14T18:18:29Z)
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network [63.554061288184165]
We propose a novel parameterized text shape method based on low-rank approximation. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. We implement an accurate and efficient arbitrary-shaped text detector named LRANet.
arXiv Detail & Related papers (2023-06-27T02:03:46Z)
Transforming Sequence Tagging Into A Seq2Seq Task [10.130389627403433]
We study different formats one could use for casting input text sentences into the input and target of a Seq2Seq model. We introduce a new format, which we show to not only be simpler but also more effective. We find that the new format is more robust and almost completely devoid of hallucination.
arXiv Detail & Related papers (2022-03-16T03:48:14Z)
FormatFuzzer: Effective Fuzzing of Binary File Formats [11.201540907330436]
We present FormatFuzzer, a generator for format-specific fuzzers. The format-specific fuzzer can be used as a standalone producer or mutator in black-box settings.
arXiv Detail & Related papers (2021-09-23T10:28:35Z)
byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings [77.6701264226519]
We introduce byteSteady, a fast model for classification using byte-level n-gram embeddings. A straightforward application of byteSteady is text classification. We also apply byteSteady to one type of non-language data -- DNA sequences for gene classification.
arXiv Detail & Related papers (2021-06-24T20:14:48Z)
Sent2Matrix: Folding Character Sequences in Serpentine Manifolds for Two-Dimensional Sentence [54.6266741821988]
We propose to convert texts into 2-D representations and develop the Sent2Matrix method. Our method allows for the explicit incorporation of both word morphologies and boundaries. Notably, our method is the first attempt to represent texts in 2-D formats.
arXiv Detail & Related papers (2021-03-15T13:52:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.