LT@Helsinki at SemEval-2020 Task 12: Multilingual or language-specific
BERT?
- URL: http://arxiv.org/abs/2008.00805v1
- Date: Mon, 3 Aug 2020 12:03:17 GMT
- Title: LT@Helsinki at SemEval-2020 Task 12: Multilingual or language-specific
BERT?
- Authors: Marc P\`amies, Emily \"Ohman, Kaisla Kajava, J\"org Tiedemann
- Abstract summary: This paper presents the different models submitted by the LT@Heldirectional team for the SemEval 2020 Shared Task 12.
Our team participated in sub-tasks A and C; titled offensive language identification and offense target identification, respectively.
In both cases we used the so-called Bisinki Representation from Transformer (BERT), a model pre-trained by Google and fine-tuned by us on the OLID and SOLID datasets.
- Score: 0.42056926734482064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the different models submitted by the LT@Helsinki team
for the SemEval 2020 Shared Task 12. Our team participated in sub-tasks A and
C; titled offensive language identification and offense target identification,
respectively. In both cases we used the so-called Bidirectional Encoder
Representation from Transformer (BERT), a model pre-trained by Google and
fine-tuned by us on the OLID and SOLID datasets. The results show that
offensive tweet classification is one of several language-based tasks where
BERT can achieve state-of-the-art results.
Related papers
- Transfer-Free Data-Efficient Multilingual Slot Labeling [82.02076369811402]
Slot labeling is a core component of task-oriented dialogue (ToD) systems.
To mitigate the inherent data scarcity issue, current research on multilingual ToD assumes that sufficient English-language annotated data are always available.
We propose a two-stage slot labeling approach (termed TWOSL) which transforms standard multilingual sentence encoders into effective slot labelers.
arXiv Detail & Related papers (2023-05-22T22:47:32Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - MCL@IITK at SemEval-2021 Task 2: Multilingual and Cross-lingual
Word-in-Context Disambiguation using Augmented Data, Signals, and
Transformers [1.869621561196521]
We present our approach for solving the SemEval 2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC)
The goal is to detect whether a given word common to both the sentences evokes the same meaning.
We submit systems for both the settings - Multilingual and Cross-Lingual.
arXiv Detail & Related papers (2021-04-04T08:49:28Z) - Bertinho: Galician BERT Representations [14.341471404165349]
This paper presents a monolingual BERT model for Galician.
We release two models, built using 6 and 12 transformer layers, respectively.
We show that our models, especially the 12-layer one, outperform the results of mBERT in most tasks.
arXiv Detail & Related papers (2021-03-25T12:51:34Z) - Explicit Alignment Objectives for Multilingual Bidirectional Encoders [111.65322283420805]
We present a new method for learning multilingual encoders, AMBER (Aligned Multilingual Bi-directional EncodeR)
AMBER is trained on additional parallel data using two explicit alignment objectives that align the multilingual representations at different granularities.
Experimental results show that AMBER obtains gains of up to 1.1 average F1 score on sequence tagging and up to 27.3 average accuracy on retrieval over the XLMR-large model.
arXiv Detail & Related papers (2020-10-15T18:34:13Z) - Galileo at SemEval-2020 Task 12: Multi-lingual Learning for Offensive
Language Identification using Pre-trained Language Models [11.868582973877626]
This paper describes Galileo's performance in SemEval-2020 Task 12 on detecting and categorizing offensive language in social media.
For Offensive Language Identification, we proposed a multi-lingual method using Pre-trained Language Models, ERNIE and XLM-R.
For offensive language categorization, we proposed a knowledge distillation method trained on soft labels generated by several supervised models.
arXiv Detail & Related papers (2020-10-07T17:40:19Z) - ANDES at SemEval-2020 Task 12: A jointly-trained BERT multilingual model
for offensive language detection [0.6445605125467572]
We jointly-trained a single model by fine-tuning Multilingual BERT to tackle the task across all the proposed languages.
Our single model had competitive results, with a performance close to top-performing systems.
arXiv Detail & Related papers (2020-08-13T16:07:00Z) - LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for
Multi-Granular Propaganda Span Identification [70.1903083747775]
This paper describes our submission for the task of Propaganda Span Identification in news articles.
We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda.
arXiv Detail & Related papers (2020-08-11T16:14:47Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z) - LIIR at SemEval-2020 Task 12: A Cross-Lingual Augmentation Approach for
Multilingual Offensive Language Identification [19.23116755449024]
We adapt and fine-tune the BERT and Multilingual Bert models made available by Google AI for English and non-English languages respectively.
For the English language, we use a combination of two fine-tuned BERT models.
For other languages we propose a cross-lingual augmentation approach in order to enrich training data and we use Multilingual BERT to obtain sentence representations.
arXiv Detail & Related papers (2020-05-07T18:45:48Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.