Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization
- URL: http://arxiv.org/abs/2106.12672v1
- Date: Wed, 23 Jun 2021 22:24:14 GMT
- Title: Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization
- Authors: Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung,
Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler
- Abstract summary: We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
- Score: 50.16128796194463
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art models in natural language processing rely on separate rigid
subword tokenization algorithms, which limit their generalization ability and
adaptation to new settings. In this paper, we propose a new model inductive
bias that learns a subword tokenization end-to-end as part of the model. To
this end, we introduce a soft gradient-based subword tokenization module (GBST)
that automatically learns latent subword representations from characters in a
data-driven fashion. Concretely, GBST enumerates candidate subword blocks and
learns to score them in a position-wise fashion using a block scoring network.
We additionally introduce Charformer, a deep Transformer model that integrates
GBST and operates on the byte level. Via extensive experiments on English GLUE,
multilingual, and noisy text datasets, we show that Charformer outperforms a
series of competitive byte-level baselines while generally performing on par
and sometimes outperforming subword-based models. Additionally, Charformer is
fast, improving the speed of both vanilla byte-level and subword-level
Transformers by 28%-100% while maintaining competitive quality. We believe this
work paves the way for highly performant token-free models that are trained
completely end-to-end.
Related papers
- From Characters to Words: Hierarchical Pre-trained Language Model for
Open-vocabulary Language Understanding [22.390804161191635]
Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens.
This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes.
We introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach.
arXiv Detail & Related papers (2023-05-23T23:22:20Z) - Word-Level Representation From Bytes For Language Modeling [46.28198397863388]
Sub-word tokenization is not robust to noise and difficult to generalize to new languages.
We introduce a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level hidden states.
Byte2Word is on par with the strong sub-word baseline BERT but only takes up 10% of embedding size.
arXiv Detail & Related papers (2022-11-23T03:11:13Z) - Local Byte Fusion for Neural Machine Translation [19.16966721276286]
Subword tokenization schemes are the dominant technique used in current NLP models.
Byte-based methods i.e. tokenization into byte sequences are an alternative.
Experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional models.
arXiv Detail & Related papers (2022-05-23T17:49:02Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Learning to Look Inside: Augmenting Token-Based Encoders with
Character-Level Information [29.633735942273997]
XRayEmb is a method for retrofitting existing token-based models with character-level information.
We show that incorporating XRayEmb's learned vectors into sequences of pre-trained token embeddings helps performance on both autoregressive and masked pre-trained transformer architectures.
arXiv Detail & Related papers (2021-08-01T08:09:26Z) - Char2Subword: Extending the Subword Embedding Space Using Robust
Character Compositionality [24.80654159288458]
We propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT.
Our module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation.
We show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.
arXiv Detail & Related papers (2020-10-24T01:08:28Z) - Word Shape Matters: Robust Machine Translation with Visual Embedding [78.96234298075389]
We introduce a new encoding of the input symbols for character-level NLP models.
It encodes the shape of each character through the images depicting the letters when printed.
We name this new strategy visual embedding and it is expected to improve the robustness of NLP models.
arXiv Detail & Related papers (2020-10-20T04:08:03Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Towards Reasonably-Sized Character-Level Transformer NMT by Finetuning
Subword Systems [78.80826533405019]
We show that we can obtain a neural machine translation model that works at the character level without requiring token segmentation.
Our study is a significant step towards high-performance and easy to train character-based models that are not extremely large.
arXiv Detail & Related papers (2020-04-29T15:56:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.