Neural Machine Translation without Embeddings
- URL: http://arxiv.org/abs/2008.09396v2
- Date: Mon, 12 Apr 2021 13:33:25 GMT
- Title: Neural Machine Translation without Embeddings
- Authors: Uri Shaham and Omer Levy
- Abstract summary: Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and subword induction algorithms.
A simple universal alternative is to represent every computerized text as a sequence of bytes via-8.
Experiments on byteto-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models.
- Score: 44.129310924201604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many NLP models operate over sequences of subword tokens produced by
hand-crafted tokenization rules and heuristic subword induction algorithms. A
simple universal alternative is to represent every computerized text as a
sequence of bytes via UTF-8, obviating the need for an embedding layer since
there are fewer token types (256) than dimensions. Surprisingly, replacing the
ubiquitous embedding layer with one-hot representations of each byte does not
hurt performance; experiments on byte-to-byte machine translation from English
to 10 different languages show a consistent improvement in BLEU, rivaling
character-level and even standard subword-level models. A deeper investigation
reveals that the combination of embeddingless models with decoder-input dropout
amounts to token dropout, which benefits byte-to-byte models in particular.
Related papers
- Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process.
We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing.
We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z) - From Language Models over Tokens to Language Models over Characters [54.123846188068384]
Modern language models are internally -- and mathematically -- distributions over token strings rather than emphcharacter strings.
This paper presents algorithms for converting token-level language models to character-level ones.
arXiv Detail & Related papers (2024-12-04T21:19:20Z) - Retrofitting Large Language Models with Dynamic Tokenization [3.608780819053423]
We propose retrofitting current language models with dynamic tokenization.
We merge frequent subword sequences in a batch, then apply a pre-trained embedding-prediction hypernetwork to compute the token embeddings on-the-fly.
We find that dynamic tokenization can mitigate the limitations of static tokenization by substantially improving inference speed and promoting fairness across languages.
arXiv Detail & Related papers (2024-11-27T17:51:58Z) - Local Byte Fusion for Neural Machine Translation [19.16966721276286]
Subword tokenization schemes are the dominant technique used in current NLP models.
Byte-based methods i.e. tokenization into byte sequences are an alternative.
Experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional models.
arXiv Detail & Related papers (2022-05-23T17:49:02Z) - byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings [77.6701264226519]
We introduce byteSteady, a fast model for classification using byte-level n-gram embeddings.
A straightforward application of byteSteady is text classification.
We also apply byteSteady to one type of non-language data -- DNA sequences for gene classification.
arXiv Detail & Related papers (2021-06-24T20:14:48Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - ByT5: Towards a token-free future with pre-trained byte-to-byte models [23.532359202069063]
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
We show that a standard Transformer architecture can be used with minimal modifications to process byte sequences.
We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation.
arXiv Detail & Related papers (2021-05-28T07:03:22Z) - Word Shape Matters: Robust Machine Translation with Visual Embedding [78.96234298075389]
We introduce a new encoding of the input symbols for character-level NLP models.
It encodes the shape of each character through the images depicting the letters when printed.
We name this new strategy visual embedding and it is expected to improve the robustness of NLP models.
arXiv Detail & Related papers (2020-10-20T04:08:03Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.