Bit Cipher -- A Simple yet Powerful Word Representation System that
Integrates Efficiently with Language Models
- URL: http://arxiv.org/abs/2311.11012v1
- Date: Sat, 18 Nov 2023 08:47:35 GMT
- Title: Bit Cipher -- A Simple yet Powerful Word Representation System that
Integrates Efficiently with Language Models
- Authors: Haoran Zhao and Jake Ryland Williams
- Abstract summary: Bit-cipher is a word representation system that eliminates the need of backpropagation and hyper-efficient dimensionality reduction techniques.
We perform probing experiments on part-of-speech (POS) tagging and named entity recognition (NER) to assess bit-cipher's competitiveness with classic embeddings.
By replacing embedding layers with cipher embeddings, our experiments illustrate the notable efficiency of cipher in accelerating the training process and attaining better optima.
- Score: 4.807347156077897
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Large Language Models (LLMs) become ever more dominant, classic
pre-trained word embeddings sustain their relevance through computational
efficiency and nuanced linguistic interpretation. Drawing from recent studies
demonstrating that the convergence of GloVe and word2vec optimizations all tend
towards log-co-occurrence matrix variants, we construct a novel word
representation system called Bit-cipher that eliminates the need of
backpropagation while leveraging contextual information and hyper-efficient
dimensionality reduction techniques based on unigram frequency, providing
strong interpretability, alongside efficiency. We use the bit-cipher algorithm
to train word vectors via a two-step process that critically relies on a
hyperparameter -- bits -- that controls the vector dimension. While the first
step trains the bit-cipher, the second utilizes it under two different
aggregation modes -- summation or concatenation -- to produce contextually rich
representations from word co-occurrences. We extend our investigation into
bit-cipher's efficacy, performing probing experiments on part-of-speech (POS)
tagging and named entity recognition (NER) to assess its competitiveness with
classic embeddings like word2vec and GloVe. Additionally, we explore its
applicability in LM training and fine-tuning. By replacing embedding layers
with cipher embeddings, our experiments illustrate the notable efficiency of
cipher in accelerating the training process and attaining better optima
compared to conventional training paradigms. Experiments on the integration of
bit-cipher embedding layers with Roberta, T5, and OPT, prior to or as a
substitute for fine-tuning, showcase a promising enhancement to transfer
learning, allowing rapid model convergence while preserving competitive
performance.
Related papers
- Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Sequence Shortening for Context-Aware Machine Translation [5.803309695504831]
We show that a special case of multi-encoder architecture achieves higher accuracy on contrastive datasets.
We introduce two novel methods - Latent Grouping and Latent Selecting, where the network learns to group tokens or selects the tokens to be cached as context.
arXiv Detail & Related papers (2024-02-02T13:55:37Z) - Sentiment analysis in Tourism: Fine-tuning BERT or sentence embeddings
concatenation? [0.0]
We conduct a comparative study between Fine-Tuning the Bidirectional Representations from Transformers and a method of concatenating two embeddings to boost the performance of a stacked Bidirectional Long Short-Term Memory-Bidirectional Gated Recurrent Units model.
A search for the best learning rate was made at the level of the two approaches, and a comparison of the best embeddings was made for each sentence embedding combination.
arXiv Detail & Related papers (2023-12-12T23:23:23Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Word Sense Induction with Hierarchical Clustering and Mutual Information
Maximization [14.997937028599255]
Word sense induction is a difficult problem in natural language processing.
We propose a novel unsupervised method based on hierarchical clustering and invariant information clustering.
We empirically demonstrate that, in certain cases, our approach outperforms prior WSI state-of-the-art methods.
arXiv Detail & Related papers (2022-10-11T13:04:06Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Orthros: Non-autoregressive End-to-end Speech Translation with
Dual-decoder [64.55176104620848]
We propose a novel NAR E2E-ST framework, Orthros, in which both NAR and autoregressive (AR) decoders are jointly trained on the shared speech encoder.
The latter is used for selecting better translation among various length candidates generated from the former, which dramatically improves the effectiveness of a large length beam with negligible overhead.
Experiments on four benchmarks show the effectiveness of the proposed method in improving inference speed while maintaining competitive translation quality.
arXiv Detail & Related papers (2020-10-25T06:35:30Z) - Keyphrase Extraction with Dynamic Graph Convolutional Networks and
Diversified Inference [50.768682650658384]
Keyphrase extraction (KE) aims to summarize a set of phrases that accurately express a concept or a topic covered in a given document.
Recent Sequence-to-Sequence (Seq2Seq) based generative framework is widely used in KE task, and it has obtained competitive performance on various benchmarks.
In this paper, we propose to adopt the Dynamic Graph Convolutional Networks (DGCN) to solve the above two problems simultaneously.
arXiv Detail & Related papers (2020-10-24T08:11:23Z) - Computationally Efficient NER Taggers with Combined Embeddings and
Constrained Decoding [10.643105866460978]
Current State-of-the-Art models in Named Entity Recognition (NER) are neural models with a Conditional Random Field (CRF) as the final network layer, and pre-trained "contextual embeddings"
In this work, we explore two simple techniques that substantially improve NER performance over a strong baseline with negligible cost.
While training a tagger on CoNLL 2003 we find a $786$% speed-up over a contextual embeddings-based tagger without sacrificing strong performance.
arXiv Detail & Related papers (2020-01-05T04:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.