Domain Adaptive Pretraining for Multilingual Acronym Extraction
- URL: http://arxiv.org/abs/2206.15221v1
- Date: Thu, 30 Jun 2022 12:11:39 GMT
- Title: Domain Adaptive Pretraining for Multilingual Acronym Extraction
- Authors: Usama Yaseen and Stefan Langer
- Abstract summary: This paper presents our findings from participating in the multilingual acronym extraction shared task SDU@AAAI-22.
The task consists of acronym extraction from documents in 6 languages within scientific and legal domains.
Our system (team: SMR-NLP) achieved competitive performance for acronym extraction across all the languages.
- Score: 7.318106000226068
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents our findings from participating in the multilingual
acronym extraction shared task SDU@AAAI-22. The task consists of acronym
extraction from documents in 6 languages within scientific and legal domains.
To address multilingual acronym extraction we employed BiLSTM-CRF with
multilingual XLM-RoBERTa embeddings. We pretrained the XLM-RoBERTa model on the
shared task corpus to further adapt XLM-RoBERTa embeddings to the shared task
domain(s). Our system (team: SMR-NLP) achieved competitive performance for
acronym extraction across all the languages.
Related papers
- Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market
Domain [26.045871822474723]
This study introduces a language model called ESCOXLM-R, based on XLM-R, which uses domain-adaptive pre-training on the European Skills, Competences, Qualifications and Occupations taxonomy.
We evaluate the performance of ESCOXLM-R on 6 sequence labeling and 3 classification tasks in 4 languages and find that it achieves state-of-the-art results on 6 out of 9 datasets.
arXiv Detail & Related papers (2023-05-20T04:50:20Z) - LLM-RM at SemEval-2023 Task 2: Multilingual Complex NER using
XLM-RoBERTa [13.062351454646912]
This paper focuses on solving NER tasks in a multilingual setting for complex named entities.
We approach the problem by leveraging cross-lingual representation provided by fine-tuning XLM-Roberta base model on datasets of all of the 12 languages.
arXiv Detail & Related papers (2023-05-05T06:05:45Z) - Modeling Sequential Sentence Relation to Improve Cross-lingual Dense
Retrieval [87.11836738011007]
We propose a multilingual multilingual language model called masked sentence model (MSM)
MSM consists of a sentence encoder to generate the sentence representations, and a document encoder applied to a sequence of sentence vectors from a document.
To train the model, we propose a masked sentence prediction task, which masks and predicts the sentence vector via a hierarchical contrastive loss with sampled negatives.
arXiv Detail & Related papers (2023-02-03T09:54:27Z) - XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked
Language Models [100.29953199404905]
We introduce a new approach for scaling to very large multilingual vocabularies by de-emphasizing token sharing between languages with little lexical overlap.
We train XLM-V, a multilingual language model with a one million token vocabulary.
XLM-V is particularly effective on low-resource language tasks and outperforms XLM-R by 11.2% and 5.8% absolute on MasakhaNER and Americas NLI, respectively.
arXiv Detail & Related papers (2023-01-25T09:15:17Z) - Multilingual ColBERT-X [11.768656900939048]
ColBERT-X is a dense retrieval model for Cross Language Information Retrieval ( CLIR)
In CLIR, documents are written in one natural language, while the queries are expressed in another.
A related task is multilingual IR (MLIR) where the system creates a single ranked list of documents written in many languages.
arXiv Detail & Related papers (2022-09-03T06:02:52Z) - Multi2WOZ: A Robust Multilingual Dataset and Conversational Pretraining
for Task-Oriented Dialog [67.20796950016735]
Multi2WOZ dataset spans four typologically diverse languages: Chinese, German, Arabic, and Russian.
We introduce a new framework for multilingual conversational specialization of pretrained language models (PrLMs) that aims to facilitate cross-lingual transfer for arbitrary downstream TOD tasks.
Our experiments show that, in most setups, the best performance entails the combination of (I) conversational specialization in the target language and (ii) few-shot transfer for the concrete TOD task.
arXiv Detail & Related papers (2022-05-20T18:35:38Z) - An Ensemble Approach to Acronym Extraction using Transformers [7.88595796865485]
Acronyms are abbreviated units of a phrase constructed by using initial components of the phrase in a text.
This paper discusses an ensemble approach for the task of Acronym Extraction.
arXiv Detail & Related papers (2022-01-09T14:49:46Z) - Bootstrapping Multilingual AMR with Contextual Word Alignments [15.588190959488538]
We develop a novel technique forforeign-text-to-English AMR alignment, usingthe contextual word alignment between En-glish and foreign language tokens.
This wordalignment is weakly supervised and relies onthe contextualized XLM-R word embeddings.
We achieve a highly competitive performancethat surpasses the best published results forGerman, Italian, Spanish and Chinese.
arXiv Detail & Related papers (2021-02-03T18:35:55Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.