Related papers: All Word Embeddings from One Embedding

All Word Embeddings from One Embedding

URL: http://arxiv.org/abs/2004.12073v3
Date: Fri, 23 Oct 2020 03:12:12 GMT
Title: All Word Embeddings from One Embedding
Authors: Sho Takase and Sosuke Kobayashi
Abstract summary: In neural network-based models for natural language processing, the largest part of the parameters often consists of word embeddings. In this study, to reduce the total number of parameters, the embeddings for all words are represented by transforming a shared embedding. The proposed method, ALONE, constructs the embedding of a word by modifying the shared embedding with a filter vector, which is word-specific but non-trainable.
Score: 23.643059189673473
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In neural network-based models for natural language processing (NLP), the largest part of the parameters often consists of word embeddings. Conventional models prepare a large embedding matrix whose size depends on the vocabulary size. Therefore, storing these models in memory and disk storage is costly. In this study, to reduce the total number of parameters, the embeddings for all words are represented by transforming a shared embedding. The proposed method, ALONE (all word embeddings from one), constructs the embedding of a word by modifying the shared embedding with a filter vector, which is word-specific but non-trainable. Then, we input the constructed embedding into a feed-forward neural network to increase its expressiveness. Naively, the filter vectors occupy the same memory size as the conventional embedding matrix, which depends on the vocabulary size. To solve this issue, we also introduce a memory-efficient filter construction approach. We indicate our ALONE can be used as word representation sufficiently through an experiment on the reconstruction of pre-trained word embeddings. In addition, we also conduct experiments on NLP application tasks: machine translation and summarization. We combined ALONE with the current state-of-the-art encoder-decoder model, the Transformer, and achieved comparable scores on WMT 2014 English-to-German translation and DUC 2004 very short summarization with less parameters.

Related papers

Word Embedding Dimension Reduction via Weakly-Supervised Feature Selection [34.217661429283666]
As the vocabulary grows, the vector space's dimension increases, which can lead to a vast model size. This paper explores word embedding dimension reduction. We propose an efficient and effective weakly-supervised feature selection method named WordFS.
arXiv Detail & Related papers (2024-07-17T06:36:09Z)
2D Matryoshka Sentence Embeddings [11.682642816354418]
We introduce a novel sentence embedding model called textitTwo-dimensional Matryoshka Sentence Embedding (2DMSE)footnote. It supports elastic settings for both embedding sizes and Transformer layers, offering greater flexibility and efficiency than MRL. The experimental results demonstrate the effectiveness of our proposed model in dynamically supporting different embedding sizes and Transformer layers.
arXiv Detail & Related papers (2024-02-22T18:35:05Z)
Continuously Learning New Words in Automatic Speech Recognition [56.972851337263755]
We propose an self-supervised continual learning approach to recognize new words. We use a memory-enhanced Automatic Speech Recognition model from previous work. We show that with this approach, we obtain increasing performance on the new words when they occur more frequently.
arXiv Detail & Related papers (2024-01-09T10:39:17Z)
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining [49.213120730582354]
Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. We propose a novel framework: $textbfO$ne $textbfF$or $textbfA$ll, which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively.
arXiv Detail & Related papers (2023-11-15T10:40:45Z)
Frustratingly Simple Memory Efficiency for Pre-trained Language Models via Dynamic Embedding Pruning [42.652021176354644]
The memory footprint of pre-trained language models (PLMs) can hinder deployment in memory-constrained settings. We propose a simple yet effective approach that leverages this finding to minimize the memory footprint of the embedding matrix. We show that this approach provides substantial reductions in memory usage across a wide range of models and tasks.
arXiv Detail & Related papers (2023-09-15T19:00:00Z)
Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model [58.9100867327305]
Large and sparse feed-forward layers (S-FFN) have proven effective in scaling up Transformers model size for textitpretraining large language models. We analyzed two major design choices of S-FFN: the memory block (a.k.a. expert) size and the memory block selection method. We found a simpler selection method -- textbftextttAvg-K that selects blocks through their mean aggregated hidden states, achieving lower perplexity in language model pretraining.
arXiv Detail & Related papers (2023-05-23T12:28:37Z)
Multi hash embeddings in spaCy [1.6790532021482656]
spaCy is a machine learning system that generates multi-embedding representations of words. The default embedding layer in spaCy is a hash embeddings layer. In this technical report we lay out a bit of history and introduce the embedding methods in spaCy in detail.
arXiv Detail & Related papers (2022-12-19T06:03:04Z)
HashFormers: Towards Vocabulary-independent Pre-trained Transformers [30.699644290131044]
Transformer-based pre-trained language models are vocabulary-dependent, mapping by default each token to its corresponding embedding. We propose HashFormers, a new family of vocabulary-independent pre-trained transformers.
arXiv Detail & Related papers (2022-10-14T15:39:34Z)
HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization. Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z)
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z)
Improve Variational Autoencoder for Text Generationwith Discrete Latent Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning. VAEs tend to ignore latent variables with a strong auto-regressive decoder. We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix. On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.