Related papers: Multilingual hierarchical classification of job advertisements for job vacancy statistics

Multilingual hierarchical classification of job advertisements for job vacancy statistics

URL: http://arxiv.org/abs/2411.03779v2
Date: Mon, 18 Aug 2025 10:47:56 GMT
Title: Multilingual hierarchical classification of job advertisements for job vacancy statistics
Authors: Maciej Beręsewicz, Marek Wydmuch, Herman Cherniaiev, Robert Pater,
Abstract summary: The goal of this paper is to develop a multilingual classifier for online job advertisements.<n>We show that incorporation of the hierarchical structure of occupations improves prediction accuracy by 1-2 percentage points.<n>A bilingual (Polish and English) and multilingual (24 languages) model is developed based on data translated using closed and open-source software.
Score: 1.6874375111244329
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The goal of this paper is to develop a multilingual classifier and conditional probability estimator of occupation codes for online job advertisements in accordance with the International Standard Classification of Occupations (ISCO) extended with the Polish Classification of Occupations and Specializations (KZiS), which is analogous to the European Classification of Occupations. In this paper, we utilise a range of data sources, including a novel one, namely the Central Job Offers Database, which is a register of all vacancies submitted to Public Employment Offices. Their staff members code the vacancies according to the ISCO and KZiS. A hierarchical multi-class classifier has been developed based on the transformer architecture. The classifier begins by encoding the jobs found in advertisements to the widest 1-digit occupational group, and then narrows the assignment to a 6-digit occupation code. We show that incorporation of the hierarchical structure of occupations improves prediction accuracy by 1-2 percentage points, particularly for the hand-coded online job advertisements. Finally, a bilingual (Polish and English) and multilingual (24 languages) model is developed based on data translated using closed and open-source software. The open-source software is provided for the benefit of the official statistics community, with a particular focus on international comparability.

Related papers

Enhancing Job Matching: Occupation, Skill and Qualification Linking with the ESCO and EQF taxonomies [0.0]
This study investigates the potential of language models to improve the classification of labor market information.<n>We examine and compare two prominent methodologies from the literature: Sentence Linking and Entity Linking.<n>In support of ongoing research, we release an open-source tool, incorporating these two methodologies.
arXiv Detail & Related papers (2025-12-02T19:49:43Z)
Standard Occupation Classifier -- A Natural Language Processing Approach [0.0]
This project investigates the use of recent developments in natural language processing to construct a classifier capable of assigning an occupation code to a given job advertisement.<n>We develop various classifiers for both UK ONS SOC and US O*NET SOC, using different Language Models.<n>We find that an ensemble model, which combines Google BERT and a Neural Network classifier while considering job title, description, and skills, achieved the highest prediction accuracy.
arXiv Detail & Related papers (2025-11-28T10:30:37Z)
Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management [0.2276267460638319]
We present TalentCLEF 2025, the first evaluation campaign focused on skill and job title intelligence.<n>The evaluations included monolingual and cross-lingual scenarios and covered the evaluation of gender bias.<n> TalentCLEF provides the first public benchmark in this field and encourages the development of robust, fair, and transferable language technologies for the labor market.
arXiv Detail & Related papers (2025-07-17T16:33:57Z)
JobHop: A Large-Scale Dataset of Career Trajectories [48.881023210777585]
JobHop is a large-scale public dataset derived from anonymized resumes provided by VDAB, the public employment service in Flanders, Belgium.<n>We process unstructured resume data to extract structured career information, which is then mapped to standardized ESCO occupation codes.<n>This results in a rich dataset of over 2.3 million work experiences, extracted from and grouped into more than 391,000 user resumes.
arXiv Detail & Related papers (2025-05-12T15:22:29Z)
Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts. Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models. The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z)
Hierarchical Classification of Transversal Skills in Job Ads Based on Sentence Embeddings [0.0]
This paper aims to identify correlations between job ad requirements and skill sets using a deep learning model. The approach involves data collection, preprocessing, and labeling using ESCO (European Skills, Competences, and Occupations) taxonomy.
arXiv Detail & Related papers (2024-01-10T11:07:32Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
Transfer-Free Data-Efficient Multilingual Slot Labeling [82.02076369811402]
Slot labeling is a core component of task-oriented dialogue (ToD) systems. To mitigate the inherent data scarcity issue, current research on multilingual ToD assumes that sufficient English-language annotated data are always available. We propose a two-stage slot labeling approach (termed TWOSL) which transforms standard multilingual sentence encoders into effective slot labelers.
arXiv Detail & Related papers (2023-05-22T22:47:32Z)
ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market Domain [26.045871822474723]
This study introduces a language model called ESCOXLM-R, based on XLM-R, which uses domain-adaptive pre-training on the European Skills, Competences, Qualifications and Occupations taxonomy. We evaluate the performance of ESCOXLM-R on 6 sequence labeling and 3 classification tasks in 4 languages and find that it achieves state-of-the-art results on 6 out of 9 datasets.
arXiv Detail & Related papers (2023-05-20T04:50:20Z)
Predicting Job Titles from Job Descriptions with Multi-label Text Classification [0.0]
We propose the multi-label classification approach for predicting relevant job titles from job description texts. We implement the Bi-GRU-LSTM-CNN with different pre-trained language models to apply for the job titles prediction problem.
arXiv Detail & Related papers (2021-12-21T09:31:03Z)
On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks. We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments. We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z)
Prix-LM: Pretraining for Multilingual Knowledge Base Construction [59.02868906044296]
We propose a unified framework, Prix-LM, for multilingual knowledge construction and completion. We leverage two types of knowledge, monolingual triples and cross-lingual links, extracted from existing multilingual KBs. Experiments on standard entity-related tasks, such as link prediction in multiple languages, cross-lingual entity linking and bilingual lexicon induction, demonstrate its effectiveness.
arXiv Detail & Related papers (2021-10-16T02:08:46Z)
Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL) We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task. We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.