Related papers: Neural Machine Translation without Embeddings

Neural Machine Translation without Embeddings

URL: http://arxiv.org/abs/2008.09396v2
Date: Mon, 12 Apr 2021 13:33:25 GMT
Title: Neural Machine Translation without Embeddings
Authors: Uri Shaham and Omer Levy
Abstract summary: Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes via-8. Experiments on byteto-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models.
Score: 44.129310924201604
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes via UTF-8, obviating the need for an embedding layer since there are fewer token types (256) than dimensions. Surprisingly, replacing the ubiquitous embedding layer with one-hot representations of each byte does not hurt performance; experiments on byte-to-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models. A deeper investigation reveals that the combination of embeddingless models with decoder-input dropout amounts to token dropout, which benefits byte-to-byte models in particular.

Related papers

Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing. We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z)
From Language Models over Tokens to Language Models over Characters [54.123846188068384]
Modern language models are internally -- and mathematically -- distributions over token strings rather than emphcharacter strings. This paper presents algorithms for converting token-level language models to character-level ones.
arXiv Detail & Related papers (2024-12-04T21:19:20Z)
Retrofitting Large Language Models with Dynamic Tokenization [3.608780819053423]
We propose retrofitting current language models with dynamic tokenization. We merge frequent subword sequences in a batch, then apply a pre-trained embedding-prediction hypernetwork to compute the token embeddings on-the-fly. We find that dynamic tokenization can mitigate the limitations of static tokenization by substantially improving inference speed and promoting fairness across languages.
arXiv Detail & Related papers (2024-11-27T17:51:58Z)
Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language. We show that Byte-Pair. Match (BPE) and MaxPiece (WordPiece) fit within this framework. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z)
An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords. We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z)
Local Byte Fusion for Neural Machine Translation [19.16966721276286]
Subword tokenization schemes are the dominant technique used in current NLP models. Byte-based methods i.e. tokenization into byte sequences are an alternative. Experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional models.
arXiv Detail & Related papers (2022-05-23T17:49:02Z)
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z)
byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings [77.6701264226519]
We introduce byteSteady, a fast model for classification using byte-level n-gram embeddings. A straightforward application of byteSteady is text classification. We also apply byteSteady to one type of non-language data -- DNA sequences for gene classification.
arXiv Detail & Related papers (2021-06-24T20:14:48Z)
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z)
ByT5: Towards a token-free future with pre-trained byte-to-byte models [23.532359202069063]
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. We show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation.
arXiv Detail & Related papers (2021-05-28T07:03:22Z)
Word Shape Matters: Robust Machine Translation with Visual Embedding [78.96234298075389]
We introduce a new encoding of the input symbols for character-level NLP models. It encodes the shape of each character through the images depicting the letters when printed. We name this new strategy visual embedding and it is expected to improve the robustness of NLP models.
arXiv Detail & Related papers (2020-10-20T04:08:03Z)
BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity. Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset. We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.