Subword Pooling Makes a Difference
- URL: http://arxiv.org/abs/2102.10864v1
- Date: Mon, 22 Feb 2021 09:59:30 GMT
- Title: Subword Pooling Makes a Difference
- Authors: Judit \'Acs and \'Akos K\'ad\'ar and Andr\'as Kornai
- Abstract summary: We investigate how the choice of subword pooling affects the downstream performance on three tasks.
For morphological tasks, the widely used choose the first subword' is the worst strategy.
For POS tagging both of these strategies perform poorly and the best choice is to use a small LSTM over the subwords.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual word-representations became a standard in modern natural language
processing systems. These models use subword tokenization to handle large
vocabularies and unknown words. Word-level usage of such systems requires a way
of pooling multiple subwords that correspond to a single word. In this paper we
investigate how the choice of subword pooling affects the downstream
performance on three tasks: morphological probing, POS tagging and NER, in 9
typologically diverse languages. We compare these in two massively multilingual
models, mBERT and XLM-RoBERTa. For morphological tasks, the widely used `choose
the first subword' is the worst strategy and the best results are obtained by
using attention over the subwords. For POS tagging both of these strategies
perform poorly and the best choice is to use a small LSTM over the subwords.
The same strategy works best for NER and we show that mBERT is better than
XLM-RoBERTa in all 9 languages. We publicly release all code, data and the full
result tables at \url{https://github.com/juditacs/subword-choice}.
Related papers
- Tomato, Tomahto, Tomate: Measuring the Role of Shared Semantics among Subwords in Multilingual Language Models [88.07940818022468]
We take an initial step on measuring the role of shared semantics among subwords in the encoder-only multilingual language models (mLMs)
We form "semantic tokens" by merging the semantically similar subwords and their embeddings.
inspections on the grouped subwords show that they exhibit a wide range of semantic similarities.
arXiv Detail & Related papers (2024-11-07T08:38:32Z) - CompoundPiece: Evaluating and Improving Decompounding Performance of
Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents.
We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary.
We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z) - Impact of Subword Pooling Strategy on Cross-lingual Event Detection [2.3361634876233817]
A pooling strategy takes the subword representations as input and outputs a representation for the entire word.
We show that the choice of pooling strategy can have a significant impact on the target language performance.
We carry out our analysis with five different pooling strategies across nine languages in diverse multi-lingual datasets.
arXiv Detail & Related papers (2023-02-22T13:33:21Z) - Always Keep your Target in Mind: Studying Semantics and Improving
Performance of Neural Lexical Substitution [124.99894592871385]
We present a large-scale comparative study of lexical substitution methods employing both old and most recent language models.
We show that already competitive results achieved by SOTA LMs/MLMs can be further substantially improved if information about the target word is injected properly.
arXiv Detail & Related papers (2022-06-07T16:16:19Z) - Breaking Character: Are Subwords Good Enough for MRLs After All? [36.11778282905458]
We pretraining a BERT-style language model over character sequences instead of word-pieces.
We compare the resulting model, dubbed TavBERT, against contemporary PLMs based on subwords for three highly complex and ambiguous MRLs.
Our results show, for all tested languages, that while TavBERT obtains mild improvements on surface-level tasks, subword-based PLMs achieve significantly higher performance on semantic tasks.
arXiv Detail & Related papers (2022-04-10T18:54:43Z) - Pretraining without Wordpieces: Learning Over a Vocabulary of Millions
of Words [50.11559460111882]
We explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces.
Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension.
Since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets.
arXiv Detail & Related papers (2022-02-24T15:15:48Z) - Subword Mapping and Anchoring across Languages [1.9352552677009318]
Subword Mapping and Anchoring across Languages (SMALA) is a method to construct bilingual subword vocabularies.
SMALA extracts subword alignments using an unsupervised state-of-the-art mapping technique.
We show that joint subword vocabularies obtained with SMALA lead to higher BLEU scores on sentences that contain many false positives and false negatives.
arXiv Detail & Related papers (2021-09-09T20:46:27Z) - Multilingual Autoregressive Entity Linking [49.35994386221958]
mGENRE is a sequence-to-sequence system for the Multilingual Entity Linking problem.
For a mention in a given language, mGENRE predicts the name of the target entity left-to-right, token-by-token.
We show the efficacy of our approach through extensive evaluation including experiments on three popular MEL benchmarks.
arXiv Detail & Related papers (2021-03-23T13:25:55Z) - DBTagger: Multi-Task Learning for Keyword Mapping in NLIDBs Using
Bi-Directional Recurrent Neural Networks [0.2578242050187029]
We propose a novel deep learning based supervised approach that utilizes POS tags of NLQs.
We evaluate our approach on eight different datasets, and report new state-of-the-art accuracy results, $92.4%$ on the average.
arXiv Detail & Related papers (2021-01-11T22:54:39Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Char2Subword: Extending the Subword Embedding Space Using Robust
Character Compositionality [24.80654159288458]
We propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT.
Our module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation.
We show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.
arXiv Detail & Related papers (2020-10-24T01:08:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.