CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary
Representations From Characters
- URL: http://arxiv.org/abs/2010.10392v3
- Date: Sat, 31 Oct 2020 21:29:04 GMT
- Title: CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary
Representations From Characters
- Authors: Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Hiroshi Noji,
Pierre Zweigenbaum, Junichi Tsujii
- Abstract summary: We propose a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters.
We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level and open-vocabulary representations.
- Score: 14.956626084281638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the compelling improvements brought by BERT, many recent
representation models adopted the Transformer architecture as their main
building block, consequently inheriting the wordpiece tokenization system
despite it not being intrinsically linked to the notion of Transformers. While
this system is thought to achieve a good balance between the flexibility of
characters and the efficiency of full words, using predefined wordpiece
vocabularies from the general domain is not always suitable, especially when
building models for specialized domains (e.g., the medical domain). Moreover,
adopting a wordpiece tokenization shifts the focus from the word level to the
subword level, making the models conceptually more complex and arguably less
convenient in practice. For these reasons, we propose CharacterBERT, a new
variant of BERT that drops the wordpiece system altogether and uses a
Character-CNN module instead to represent entire words by consulting their
characters. We show that this new model improves the performance of BERT on a
variety of medical domain tasks while at the same time producing robust,
word-level and open-vocabulary representations.
Related papers
- From Characters to Words: Hierarchical Pre-trained Language Model for
Open-vocabulary Language Understanding [22.390804161191635]
Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens.
This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes.
We introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach.
arXiv Detail & Related papers (2023-05-23T23:22:20Z) - Trading Syntax Trees for Wordpieces: Target-oriented Opinion Words
Extraction with Wordpieces and Aspect Enhancement [33.66973706499751]
State-of-the-art target-oriented opinion word extraction (TOWE) models typically use BERT-based text encoders that operate on the word level.
These methods achieve limited gains with graph convolutional networks (GCNs) and have difficulty using BERT wordpieces.
This work trades syntax trees for BERT wordpieces by entirely removing the GCN component from the methods' architectures.
arXiv Detail & Related papers (2023-05-18T15:22:00Z) - Pretraining without Wordpieces: Learning Over a Vocabulary of Millions
of Words [50.11559460111882]
We explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces.
Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension.
Since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets.
arXiv Detail & Related papers (2022-02-24T15:15:48Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - Char2Subword: Extending the Subword Embedding Space Using Robust
Character Compositionality [24.80654159288458]
We propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT.
Our module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation.
We show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.
arXiv Detail & Related papers (2020-10-24T01:08:28Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z) - Interpretability Analysis for Named Entity Recognition to Understand
System Predictions and How They Can Improve [49.878051587667244]
We examine the performance of several variants of LSTM-CRF architectures for named entity recognition.
We find that context representations do contribute to system performance, but that the main factor driving high performance is learning the name tokens themselves.
We enlist human annotators to evaluate the feasibility of inferring entity types from the context alone and find that, while people are not able to infer the entity type either for the majority of the errors made by the context-only system, there is some room for improvement.
arXiv Detail & Related papers (2020-04-09T14:37:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.