From Characters to Words: Hierarchical Pre-trained Language Model for
Open-vocabulary Language Understanding
- URL: http://arxiv.org/abs/2305.14571v2
- Date: Tue, 30 May 2023 03:36:13 GMT
- Title: From Characters to Words: Hierarchical Pre-trained Language Model for
Open-vocabulary Language Understanding
- Authors: Li Sun, Florian Luisier, Kayhan Batmanghelich, Dinei Florencio, Cha
Zhang
- Abstract summary: Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens.
This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes.
We introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach.
- Score: 22.390804161191635
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Current state-of-the-art models for natural language understanding require a
preprocessing step to convert raw text into discrete tokens. This process known
as tokenization relies on a pre-built vocabulary of words or sub-word
morphemes. This fixed vocabulary limits the model's robustness to spelling
errors and its capacity to adapt to new domains. In this work, we introduce a
novel open-vocabulary language model that adopts a hierarchical two-level
approach: one at the word level and another at the sequence level. Concretely,
we design an intra-word module that uses a shallow Transformer architecture to
learn word representations from their characters, and a deep inter-word
Transformer module that contextualizes each word representation by attending to
the entire word sequence. Our model thus directly operates on character
sequences with explicit awareness of word boundaries, but without biased
sub-word or word-level vocabulary. Experiments on various downstream tasks show
that our method outperforms strong baselines. We also demonstrate that our
hierarchical model is robust to textual corruption and domain shift.
Related papers
- Unsupervised Morphological Tree Tokenizer [36.584680344291556]
We introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words.
Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named $textitOverriding$ to ensure the indecomposability of morphemes.
Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner.
arXiv Detail & Related papers (2024-06-21T15:35:49Z) - Inducing Character-level Structure in Subword-based Language Models with
Type-level Interchange Intervention Training [36.19870483966741]
We develop a causal intervention framework to learn robust and interpretable character representations inside subword-based language models.
Our method treats each character as a typed variable in a causal model and learns such causal structures.
We additionally introduce a suite of character-level tasks that systematically vary in their dependence on meaning and sequence-level context.
arXiv Detail & Related papers (2022-12-19T22:37:46Z) - Word-Level Representation From Bytes For Language Modeling [46.28198397863388]
Sub-word tokenization is not robust to noise and difficult to generalize to new languages.
We introduce a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level hidden states.
Byte2Word is on par with the strong sub-word baseline BERT but only takes up 10% of embedding size.
arXiv Detail & Related papers (2022-11-23T03:11:13Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - SLM: Learning a Discourse Language Representation with Sentence
Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation.
We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z) - Unsupervised Distillation of Syntactic Information from Contextualized
Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations.
To this end, we automatically generate groups of sentences which are structurally similar but semantically different.
We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Morphological Skip-Gram: Using morphological knowledge to improve word
representation [2.0129974477913457]
We propose a new method for training word embeddings by replacing the FastText bag of character n-grams for a bag of word morphemes.
The results show a competitive performance compared to FastText.
arXiv Detail & Related papers (2020-07-20T12:47:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.