An Empirical Study of Pre-trained Transformers for Arabic Information
Extraction
- URL: http://arxiv.org/abs/2004.14519v5
- Date: Sat, 7 Nov 2020 14:40:25 GMT
- Title: An Empirical Study of Pre-trained Transformers for Arabic Information
Extraction
- Authors: Wuwei Lan, Yang Chen, Wei Xu and Alan Ritter
- Abstract summary: We pre-train a customized bilingual BERT, dubbed GigaBERT, specifically for Arabic NLP and English-to-Arabic zero-shot transfer learning.
We study GigaBERT's effectiveness on zero-short transfer across four IE tasks: named entity recognition, part-of-speech tagging, argument role labeling, and relation extraction.
Our best model significantly outperforms mBERT, XLM-RoBERTa, and AraBERT in both the supervised and zero-shot transfer settings.
- Score: 25.10651348642055
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multilingual pre-trained Transformers, such as mBERT (Devlin et al., 2019)
and XLM-RoBERTa (Conneau et al., 2020a), have been shown to enable the
effective cross-lingual zero-shot transfer. However, their performance on
Arabic information extraction (IE) tasks is not very well studied. In this
paper, we pre-train a customized bilingual BERT, dubbed GigaBERT, that is
designed specifically for Arabic NLP and English-to-Arabic zero-shot transfer
learning. We study GigaBERT's effectiveness on zero-short transfer across four
IE tasks: named entity recognition, part-of-speech tagging, argument role
labeling, and relation extraction. Our best model significantly outperforms
mBERT, XLM-RoBERTa, and AraBERT (Antoun et al., 2020) in both the supervised
and zero-shot transfer settings. We have made our pre-trained models publicly
available at https://github.com/lanwuwei/GigaBERT.
Related papers
- Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - NarrowBERT: Accelerating Masked Language Model Pretraining and Inference [50.59811343945605]
We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than $2times$.
NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining.
We show that NarrowBERT increases the throughput at inference time by as much as $3.5times$ with minimal (or no) performance degradation on sentence encoding tasks like MNLI.
arXiv Detail & Related papers (2023-01-11T23:45:50Z) - RobBERT-2022: Updating a Dutch Language Model to Account for Evolving
Language Use [9.797319790710711]
We update RobBERT, a state-of-the-art Dutch language model, which was trained in 2019.
First, the tokenizer of RobBERT is updated to include new high-frequent tokens present in the latest Dutch OSCAR corpus.
To evaluate if our new model is a plug-in replacement for RobBERT, we introduce two additional criteria based on concept drift of existing tokens and alignment for novel tokens.
arXiv Detail & Related papers (2022-11-15T14:55:53Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - Pre-Training BERT on Arabic Tweets: Practical Considerations [11.087099497830552]
We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing.
All are intended to support Arabic dialects and social media.
New models achieve new state-of-the-art results on several downstream tasks.
arXiv Detail & Related papers (2021-02-21T20:51:33Z) - GottBERT: a pure German Language Model [0.0]
No German single language RoBERTa model is yet published, which we introduce in this work (GottBERT)
In an evaluation we compare its performance on the two Named Entity Recognition (NER) tasks Conll 2003 and GermEval 2014 as well as on the text classification tasks GermEval 2018 (fine and coarse) and GNAD with existing German single language BERT models and two multilingual ones.
GottBERT was successfully pre-trained on a 256 core TPU pod using the RoBERTa BASE architecture.
arXiv Detail & Related papers (2020-12-03T17:45:03Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - Accenture at CheckThat! 2020: If you say so: Post-hoc fact-checking of
claims using transformer-based models [0.0]
We introduce the strategies used by the Accenture Team for the CLEF 2020 CheckThat! Lab, Task 1, on English and Arabic.
This shared task evaluated whether a claim in social media text should be professionally fact checked.
We utilized BERT and RoBERTa models to identify claims in social media text a professional fact-checker should review.
arXiv Detail & Related papers (2020-09-05T01:44:11Z) - Revisiting Pre-Trained Models for Chinese Natural Language Processing [73.65780892128389]
We revisit Chinese pre-trained language models to examine their effectiveness in a non-English language.
We also propose a model called MacBERT, which improves upon RoBERTa in several ways.
arXiv Detail & Related papers (2020-04-29T02:08:30Z) - Extending Multilingual BERT to Low-Resource Languages [71.0976635999159]
M-BERT (M-BERT) has been a huge success in both supervised and zero-shot cross-lingual transfer learning.
We propose a simple but effective approach to extend M-BERT so that it can benefit any new language.
arXiv Detail & Related papers (2020-04-28T16:36:41Z) - AraBERT: Transformer-based Model for Arabic Language Understanding [0.0]
We pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language.
The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks.
arXiv Detail & Related papers (2020-02-28T22:59:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.