Breaking Character: Are Subwords Good Enough for MRLs After All?
- URL: http://arxiv.org/abs/2204.04748v1
- Date: Sun, 10 Apr 2022 18:54:43 GMT
- Title: Breaking Character: Are Subwords Good Enough for MRLs After All?
- Authors: Omri Keren, Tal Avinari, Reut Tsarfaty, Omer Levy
- Abstract summary: We pretraining a BERT-style language model over character sequences instead of word-pieces.
We compare the resulting model, dubbed TavBERT, against contemporary PLMs based on subwords for three highly complex and ambiguous MRLs.
Our results show, for all tested languages, that while TavBERT obtains mild improvements on surface-level tasks, subword-based PLMs achieve significantly higher performance on semantic tasks.
- Score: 36.11778282905458
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Large pretrained language models (PLMs) typically tokenize the input string
into contiguous subwords before any pretraining or inference. However, previous
studies have claimed that this form of subword tokenization is inadequate for
processing morphologically-rich languages (MRLs). We revisit this hypothesis by
pretraining a BERT-style masked language model over character sequences instead
of word-pieces. We compare the resulting model, dubbed TavBERT, against
contemporary PLMs based on subwords for three highly complex and ambiguous MRLs
(Hebrew, Turkish, and Arabic), testing them on both morphological and semantic
tasks. Our results show, for all tested languages, that while TavBERT obtains
mild improvements on surface-level tasks \`a la POS tagging and full
morphological disambiguation, subword-based PLMs achieve significantly higher
performance on semantic tasks, such as named entity recognition and extractive
question answering. These results showcase and (re)confirm the potential of
subword tokenization as a reasonable modeling assumption for many languages,
including MRLs.
Related papers
- Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs)
We form "semantic tokens" by merging the semantically similar subwords and their embeddings.
inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Learning Mutually Informed Representations for Characters and Subwords [26.189422354038978]
We introduce the entanglement model, aiming to combine character and subword language models.
Inspired by vision-language models, our model treats characters and subwords as separate modalities.
We evaluate our model on text classification, named entity recognition, POS-tagging, and character-level sequence labeling.
arXiv Detail & Related papers (2023-11-14T02:09:10Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - Subword Segmental Language Modelling for Nguni Languages [7.252933737829635]
Subword segmental language model (SSLM) learns how to segment words while being trained for autoregressive language modelling.
We train our model on the 4 Nguni languages of South Africa.
Our results show that learning subword segmentation is an effective alternative to existing subword segmenters.
arXiv Detail & Related papers (2022-10-12T18:41:00Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - Superbizarre Is Not Superb: Improving BERT's Interpretations of Complex
Words with Derivational Morphology [13.535770763481905]
We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords.
Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.
arXiv Detail & Related papers (2021-01-02T08:36:48Z) - CharBERT: Character-aware Pre-trained Language Model [36.9333890698306]
We propose a character-aware pre-trained language model named CharBERT.
We first construct the contextual word embedding for each token from the sequential character representations.
We then fuse the representations of characters and the subword representations by a novel heterogeneous interaction module.
arXiv Detail & Related papers (2020-11-03T07:13:06Z) - Char2Subword: Extending the Subword Embedding Space Using Robust
Character Compositionality [24.80654159288458]
We propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT.
Our module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation.
We show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.
arXiv Detail & Related papers (2020-10-24T01:08:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.