Related papers: Local Byte Fusion for Neural Machine Translation

Local Byte Fusion for Neural Machine Translation

URL: http://arxiv.org/abs/2205.11490v3
Date: Wed, 28 Jun 2023 11:25:35 GMT
Title: Local Byte Fusion for Neural Machine Translation
Authors: Makesh Narsimhan Sreedhar, Xiangpeng Wan, Yu Cheng, Junjie Hu
Abstract summary: Subword tokenization schemes are the dominant technique used in current NLP models. Byte-based methods i.e. tokenization into byte sequences are an alternative. Experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional models.
Score: 19.16966721276286
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Subword tokenization schemes are the dominant technique used in current NLP models. However, such schemes can be rigid and tokenizers built on one corpus do not adapt well to other parallel corpora. It has also been observed that in multilingual corpora, subword tokenization schemes over-segment low-resource languages leading to a drop in translation performance. A simple alternative to subword tokenizers is byte-based methods i.e. tokenization into byte sequences using encoding schemes such as UTF-8. Byte tokens often represent inputs at a sub-character granularity i.e. one character can be represented by a sequence of multiple byte tokens. This results in byte sequences that are significantly longer than character sequences. Enforcing aggregation of local information in the lower layers can guide the model to build higher-level semantic information. We propose a Local Byte Fusion (LOBEF) method for byte-based machine translation -- utilizing byte $n$-gram and word boundaries -- to aggregate local semantic information. Extensive experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional byte-based models and even over subword techniques. Further analysis also indicates that our byte-based models are parameter-efficient and can be trained faster than subword models.

Related papers

ByteSpan: Information-Driven Subword Tokenisation [2.4723044036055306]
We propose a new information-driven subword tokeniser, ByteSpan, that uses an external byte-level LM during training to identify contiguous predictable byte sequences.<n>Experiments show that ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English.
arXiv Detail & Related papers (2025-06-23T13:42:00Z)
Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z)
Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing. We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z)
From Language Models over Tokens to Language Models over Characters [54.123846188068384]
Modern language models are internally -- and mathematically -- distributions over $ittoken$ strings rather than $itcharacter$ strings.<n>This paper presents algorithms for converting token-level language models to character-level ones.
arXiv Detail & Related papers (2024-12-04T21:19:20Z)
MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation [13.70446799743065]
Byte-based machine translation systems have shown significant potential in massively multilingual settings. Unicode encoding, which maps each character to specific byte(s), eliminates the emergence of unknown words, even in new languages. Local contextualization has proven effective in assigning initial semantics to tokens, improving sentence comprehension. We propose Adaptive MultiScale-Headed Attention (Ada-MSHA), adaptively selecting and mixing attention heads, which are treated as contextualization experts.
arXiv Detail & Related papers (2024-11-03T08:15:43Z)
Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language. We show that Byte-Pair. Match (BPE) and MaxPiece (WordPiece) fit within this framework. An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z)
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost. We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z)
Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation [14.826948179996695]
Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. We propose the Multi-Scale Contextualization (MSC) method, which learns contextualized information of varying scales across different hidden state dimensions. Experiments show that MSC significantly outperforms subword-based and other byte-based methods in both multilingual and out-of-domain scenarios.
arXiv Detail & Related papers (2024-05-29T17:19:04Z)
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z)
Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages. We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data. Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z)
Neural Machine Translation without Embeddings [44.129310924201604]
Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes via-8. Experiments on byteto-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models.
arXiv Detail & Related papers (2020-08-21T09:54:11Z)
2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z)
Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE) We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages. We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.