Related papers: Word-Level Representation From Bytes For Language Modeling

Word-Level Representation From Bytes For Language Modeling

URL: http://arxiv.org/abs/2211.12677v1
Date: Wed, 23 Nov 2022 03:11:13 GMT
Title: Word-Level Representation From Bytes For Language Modeling
Authors: Chu-Tak Lee, Qipeng Guo, Xipeng Qiu
Abstract summary: Sub-word tokenization is not robust to noise and difficult to generalize to new languages. We introduce a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level hidden states. Byte2Word is on par with the strong sub-word baseline BERT but only takes up 10% of embedding size.
Score: 46.28198397863388
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern language models mostly take sub-words as input, a design that balances the trade-off between vocabulary size, number of parameters, and performance. However, sub-word tokenization still has disadvantages like not being robust to noise and difficult to generalize to new languages. Also, the current trend of scaling up models reveals that larger models require larger embeddings but that makes parallelization hard. Previous work on image classification proves splitting raw input into a sequence of chucks is a strong, model-agnostic inductive bias. Based on this observation, we rethink the existing character-aware method that takes character-level inputs but makes word-level sequence modeling and prediction. We overhaul this method by introducing a cross-attention network that builds word-level representation directly from bytes, and a sub-word level prediction based on word-level hidden states to avoid the time and space requirement of word-level prediction. With these two improvements combined, we have a token free model with slim input embeddings for downstream tasks. We name our method Byte2Word and perform evaluations on language modeling and text classification. Experiments show that Byte2Word is on par with the strong sub-word baseline BERT but only takes up 10\% of embedding size. We further test our method on synthetic noise and cross-lingual transfer and find it competitive to baseline methods on both settings.

Related papers

From Bytes to Ideas: Language Modeling with Autoregressive U-Nets [49.16552366851748]
Tokenization imposes a fixed granularity on the input text.<n>We introduce an autoregressive U-Net that learns to embed its own tokens as it trains.
arXiv Detail & Related papers (2025-06-17T17:55:11Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding [22.390804161191635]
Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes. We introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach.
arXiv Detail & Related papers (2023-05-23T23:22:20Z)
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents. We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z)
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP [22.772546707304766]
We show how hybrid approaches of words and characters as well as subword-based approaches based on learned segmentation have been proposed and evaluated. We conclude that there is and likely will never be a silver bullet singular solution for all applications.
arXiv Detail & Related papers (2021-12-20T13:04:18Z)
Dict-BERT: Enhancing Language Model Pre-training with Dictionary [42.0998323292348]
Pre-trained language models (PLMs) aim to learn universal language representations by conducting self-supervised training tasks on large-scale corpora. In this work, we focus on enhancing language model pre-training by leveraging definitions of rare words in dictionaries. We propose two novel self-supervised pre-training tasks on word and sentence-level alignment between input text sequence and rare word definitions.
arXiv Detail & Related papers (2021-10-13T04:29:14Z)
Charformer: Fast Character Transformers via Gradient-based Subword Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z)
Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences. We introduce denoising word alignment as a new cross-lingual pre-training task. Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z)
Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size. We propose a fully compositional output embedding layer for language models. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.