Learning to Look Inside: Augmenting Token-Based Encoders with
Character-Level Information
- URL: http://arxiv.org/abs/2108.00391v1
- Date: Sun, 1 Aug 2021 08:09:26 GMT
- Title: Learning to Look Inside: Augmenting Token-Based Encoders with
Character-Level Information
- Authors: Yuval Pinter, Amanda Stent, Mark Dredze, Jacob Eisenstein
- Abstract summary: XRayEmb is a method for retrofitting existing token-based models with character-level information.
We show that incorporating XRayEmb's learned vectors into sequences of pre-trained token embeddings helps performance on both autoregressive and masked pre-trained transformer architectures.
- Score: 29.633735942273997
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Commonly-used transformer language models depend on a tokenization schema
which sets an unchangeable subword vocabulary prior to pre-training, destined
to be applied to all downstream tasks regardless of domain shift, novel word
formations, or other sources of vocabulary mismatch. Recent work has shown that
"token-free" models can be trained directly on characters or bytes, but
training these models from scratch requires substantial computational
resources, and this implies discarding the many domain-specific models that
were trained on tokens. In this paper, we present XRayEmb, a method for
retrofitting existing token-based models with character-level information.
XRayEmb is composed of a character-level "encoder" that computes vector
representations of character sequences, and a generative component that decodes
from the internal representation to a character sequence. We show that
incorporating XRayEmb's learned vectors into sequences of pre-trained token
embeddings helps performance on both autoregressive and masked pre-trained
transformer architectures and on both sequence-level and sequence tagging
tasks, particularly on non-standard English text.
Related papers
- Learning Mutually Informed Representations for Characters and Subwords [26.189422354038978]
We introduce the entanglement model, aiming to combine character and subword language models.
Inspired by vision-language models, our model treats characters and subwords as separate modalities.
We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling.
arXiv Detail & Related papers (2023-11-14T02:09:10Z) - From Characters to Words: Hierarchical Pre-trained Language Model for
Open-vocabulary Language Understanding [22.390804161191635]
Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens.
This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes.
We introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach.
arXiv Detail & Related papers (2023-05-23T23:22:20Z) - What do tokens know about their characters and how do they know it? [3.8254443661593633]
We show that pre-trained language models that use subword tokenization schemes can succeed at a variety of language tasks that require character-level information.
We show that these models robustly encode character-level information and, in general, larger models perform better at the task.
arXiv Detail & Related papers (2022-06-06T13:27:26Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - ByT5: Towards a token-free future with pre-trained byte-to-byte models [23.532359202069063]
Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
We show that a standard Transformer architecture can be used with minimal modifications to process byte sequences.
We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation.
arXiv Detail & Related papers (2021-05-28T07:03:22Z) - COCO-LM: Correcting and Contrasting Text Sequences for Language Model
Pretraining [59.169836983883656]
COCO-LM is a new self-supervised learning framework that pretrains Language Models by COrrecting challenging errors and COntrasting text sequences.
COCO-LM employs an auxiliary language model to mask-and-predict tokens in original text sequences.
Our analyses reveal that COCO-LM's advantages come from its challenging training signals, more contextualized token representations, and regularized sequence representations.
arXiv Detail & Related papers (2021-02-16T22:24:29Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.