Related papers: From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding

From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding

URL: http://arxiv.org/abs/2305.14571v2
Date: Tue, 30 May 2023 03:36:13 GMT
Title: From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding
Authors: Li Sun, Florian Luisier, Kayhan Batmanghelich, Dinei Florencio, Cha Zhang
Abstract summary: Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes. We introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach.
Score: 22.390804161191635
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes. This fixed vocabulary limits the model's robustness to spelling errors and its capacity to adapt to new domains. In this work, we introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach: one at the word level and another at the sequence level. Concretely, we design an intra-word module that uses a shallow Transformer architecture to learn word representations from their characters, and a deep inter-word Transformer module that contextualizes each word representation by attending to the entire word sequence. Our model thus directly operates on character sequences with explicit awareness of word boundaries, but without biased sub-word or word-level vocabulary. Experiments on various downstream tasks show that our method outperforms strong baselines. We also demonstrate that our hierarchical model is robust to textual corruption and domain shift.

Related papers

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets [49.16552366851748]
Tokenization imposes a fixed granularity on the input text.<n>We introduce an autoregressive U-Net that learns to embed its own tokens as it trains.
arXiv Detail & Related papers (2025-06-17T17:55:11Z)
Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process. We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing. We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z)
Unsupervised Morphological Tree Tokenizer [36.584680344291556]
We introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named $textitOverriding$ to ensure the indecomposability of morphemes. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner.
arXiv Detail & Related papers (2024-06-21T15:35:49Z)
Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training [36.19870483966741]
We develop a causal intervention framework to learn robust and interpretable character representations inside subword-based language models. Our method treats each character as a typed variable in a causal model and learns such causal structures. We additionally introduce a suite of character-level tasks that systematically vary in their dependence on meaning and sequence-level context.
arXiv Detail & Related papers (2022-12-19T22:37:46Z)
Word-Level Representation From Bytes For Language Modeling [46.28198397863388]
Sub-word tokenization is not robust to noise and difficult to generalize to new languages. We introduce a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level hidden states. Byte2Word is on par with the strong sub-word baseline BERT but only takes up 10% of embedding size.
arXiv Detail & Related papers (2022-11-23T03:11:13Z)
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z)
Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance. We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images. Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z)
SLM: Learning a Discourse Language Representation with Sentence Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation. We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)
Unsupervised Distillation of Syntactic Information from Contextualized Word Representations [62.230491683411536]
We tackle the task of unsupervised disentanglement between semantics and structure in neural language representations. To this end, we automatically generate groups of sentences which are structurally similar but semantically different. We demonstrate that our transformation clusters vectors in space by structural properties, rather than by lexical semantics.
arXiv Detail & Related papers (2020-10-11T15:13:18Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
Morphological Skip-Gram: Using morphological knowledge to improve word representation [2.0129974477913457]
We propose a new method for training word embeddings by replacing the FastText bag of character n-grams for a bag of word morphemes. The results show a competitive performance compared to FastText.
arXiv Detail & Related papers (2020-07-20T12:47:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.