All Word Embeddings from One Embedding
- URL: http://arxiv.org/abs/2004.12073v3
- Date: Fri, 23 Oct 2020 03:12:12 GMT
- Title: All Word Embeddings from One Embedding
- Authors: Sho Takase and Sosuke Kobayashi
- Abstract summary: In neural network-based models for natural language processing, the largest part of the parameters often consists of word embeddings.
In this study, to reduce the total number of parameters, the embeddings for all words are represented by transforming a shared embedding.
The proposed method, ALONE, constructs the embedding of a word by modifying the shared embedding with a filter vector, which is word-specific but non-trainable.
- Score: 23.643059189673473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In neural network-based models for natural language processing (NLP), the
largest part of the parameters often consists of word embeddings. Conventional
models prepare a large embedding matrix whose size depends on the vocabulary
size. Therefore, storing these models in memory and disk storage is costly. In
this study, to reduce the total number of parameters, the embeddings for all
words are represented by transforming a shared embedding. The proposed method,
ALONE (all word embeddings from one), constructs the embedding of a word by
modifying the shared embedding with a filter vector, which is word-specific but
non-trainable. Then, we input the constructed embedding into a feed-forward
neural network to increase its expressiveness. Naively, the filter vectors
occupy the same memory size as the conventional embedding matrix, which depends
on the vocabulary size. To solve this issue, we also introduce a
memory-efficient filter construction approach. We indicate our ALONE can be
used as word representation sufficiently through an experiment on the
reconstruction of pre-trained word embeddings. In addition, we also conduct
experiments on NLP application tasks: machine translation and summarization. We
combined ALONE with the current state-of-the-art encoder-decoder model, the
Transformer, and achieved comparable scores on WMT 2014 English-to-German
translation and DUC 2004 very short summarization with less parameters.
Related papers
- Word Embedding Dimension Reduction via Weakly-Supervised Feature Selection [34.217661429283666]
As the vocabulary grows, the vector space's dimension increases, which can lead to a vast model size.
This paper explores word embedding dimension reduction.
We propose an efficient and effective weakly-supervised feature selection method named WordFS.
arXiv Detail & Related papers (2024-07-17T06:36:09Z) - Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words.
We use a memory-enhanced Automatic Speech Recognition model from previous work.
We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z) - OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining [49.213120730582354]
Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining.
We propose a novel framework: $textbfO$ne $textbfF$or $textbfA$ll, which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively.
arXiv Detail & Related papers (2023-11-15T10:40:45Z) - Frustratingly Simple Memory Efficiency for Pre-trained Language Models
via Dynamic Embedding Pruning [42.652021176354644]
The memory footprint of pre-trained language models (PLMs) can hinder deployment in memory-constrained settings.
We propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix.
We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks.
arXiv Detail & Related papers (2023-09-15T19:00:00Z) - Towards A Unified View of Sparse Feed-Forward Network in Pretraining
Large Language Model [58.9100867327305]
Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models.
We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method.
We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
arXiv Detail & Related papers (2023-05-23T12:28:37Z) - Multi hash embeddings in spaCy [1.6790532021482656]
spaCy is a machine learning system that generates multi-embedding representations of words.
The default embedding layer in spaCy is a hash embeddings layer.
In this technical report we lay out a bit of history and introduce the embedding methods in spaCy in detail.
arXiv Detail & Related papers (2022-12-19T06:03:04Z) - HashFormers: Towards Vocabulary-independent Pre-trained Transformers [30.699644290131044]
Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding.
We propose HashFormers, a new family of vocabulary-independent pre-trained transformers.
arXiv Detail & Related papers (2022-10-14T15:39:34Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z) - Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix.
On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.