ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market
Domain
- URL: http://arxiv.org/abs/2305.12092v1
- Date: Sat, 20 May 2023 04:50:20 GMT
- Title: ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market
Domain
- Authors: Mike Zhang and Rob van der Goot and Barbara Plank
- Abstract summary: This study introduces a language model called ESCOXLM-R, based on XLM-R, which uses domain-adaptive pre-training on the European Skills, Competences, Qualifications and Occupations taxonomy.
We evaluate the performance of ESCOXLM-R on 6 sequence labeling and 3 classification tasks in 4 languages and find that it achieves state-of-the-art results on 6 out of 9 datasets.
- Score: 26.045871822474723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The increasing number of benchmarks for Natural Language Processing (NLP)
tasks in the computational job market domain highlights the demand for methods
that can handle job-related tasks such as skill extraction, skill
classification, job title classification, and de-identification. While some
approaches have been developed that are specific to the job market domain,
there is a lack of generalized, multilingual models and benchmarks for these
tasks. In this study, we introduce a language model called ESCOXLM-R, based on
XLM-R, which uses domain-adaptive pre-training on the European Skills,
Competences, Qualifications and Occupations (ESCO) taxonomy, covering 27
languages. The pre-training objectives for ESCOXLM-R include dynamic masked
language modeling and a novel additional objective for inducing multilingual
taxonomical ESCO relations. We comprehensively evaluate the performance of
ESCOXLM-R on 6 sequence labeling and 3 classification tasks in 4 languages and
find that it achieves state-of-the-art results on 6 out of 9 datasets. Our
analysis reveals that ESCOXLM-R performs better on short spans and outperforms
XLM-R on entity-level and surface-level span-F1, likely due to ESCO containing
short skill and occupation titles, and encoding information on the
entity-level.
Related papers
- Breaking Language Barriers in Multilingual Mathematical Reasoning:
Insights and Observations [90.73517523001149]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs.
By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct.
We propose different training strategies to build powerful xMR LLMs, named MathOctopus, notably outperform conventional open-source LLMs.
arXiv Detail & Related papers (2023-10-31T08:09:20Z) - SCALE: Scaling up the Complexity for Advanced Language Model Evaluation [19.339580164451256]
We introduce a novel NLP benchmark that poses challenges to current Large Language Models (LLMs)
Our benchmark comprises diverse legal NLP datasets from the Swiss legal system.
As part of our study, we evaluate several pre-trained multilingual language models on our benchmark to establish strong baselines as a point of reference.
arXiv Detail & Related papers (2023-06-15T16:19:15Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - XLM-K: Improving Cross-Lingual Language Model Pre-Training with
Multilingual Knowledge [31.765178013933134]
Cross-lingual pre-training has achieved great successes using monolingual and bilingual plain text corpora.
We propose XLM-K, a cross-lingual language model incorporating multilingual knowledge in pre-training.
arXiv Detail & Related papers (2021-09-26T11:46:20Z) - XeroAlign: Zero-Shot Cross-lingual Transformer Alignment [9.340611077939828]
We introduce a method for task-specific alignment of cross-lingual pretrained transformers such as XLM-R.
XeroAlign uses translated task data to encourage the model to generate similar sentence embeddings for different languages.
XLM-RA's text classification accuracy exceeds that of XLM-R trained with labelled data and performs on par with state-of-the-art models on a cross-lingual adversarial paraphrasing task.
arXiv Detail & Related papers (2021-05-06T07:10:00Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - On Learning Universal Representations Across Languages [37.555675157198145]
We extend existing approaches to learn sentence-level representations and show the effectiveness on cross-lingual understanding and generation.
Specifically, we propose a Hierarchical Contrastive Learning (HiCTL) method to learn universal representations for parallel sentences distributed in one or multiple languages.
We conduct evaluations on two challenging cross-lingual tasks, XTREME and machine translation.
arXiv Detail & Related papers (2020-07-31T10:58:39Z) - XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training,
Understanding and Generation [100.09099800591822]
XGLUE is a new benchmark dataset that can be used to train large-scale cross-lingual pre-trained models.
XGLUE provides 11 diversified tasks that cover both natural language understanding and generation scenarios.
arXiv Detail & Related papers (2020-04-03T07:03:12Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.