Design of Negative Sampling Strategies for Distantly Supervised Skill
Extraction
- URL: http://arxiv.org/abs/2209.05987v1
- Date: Tue, 13 Sep 2022 13:37:06 GMT
- Title: Design of Negative Sampling Strategies for Distantly Supervised Skill
Extraction
- Authors: Jens-Joris Decorte, Jeroen Van Hautte, Johannes Deleu, Chris Develder
and Thomas Demeester
- Abstract summary: We propose an end-to-end system for skill extraction, based on distant supervision through literal matching.
We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements.
We release the benchmark dataset for research purposes to stimulate further research on the task.
- Score: 19.43668931500507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Skills play a central role in the job market and many human resources (HR)
processes. In the wake of other digital experiences, today's online job market
has candidates expecting to see the right opportunities based on their skill
set. Similarly, enterprises increasingly need to use data to guarantee that the
skills within their workforce remain future-proof. However, structured
information about skills is often missing, and processes building on self- or
manager-assessment have shown to struggle with issues around adoption,
completeness, and freshness of the resulting data. Extracting skills is a
highly challenging task, given the many thousands of possible skill labels
mentioned either explicitly or merely described implicitly and the lack of
finely annotated training corpora. Previous work on skill extraction overly
simplifies the task to an explicit entity detection task or builds on manually
annotated training data that would be infeasible if applied to a complete
vocabulary of skills. We propose an end-to-end system for skill extraction,
based on distant supervision through literal matching. We propose and evaluate
several negative sampling strategies, tuned on a small validation dataset, to
improve the generalization of skill extraction towards implicitly mentioned
skills, despite the lack of such implicit skills in the distantly supervised
data. We observe that using the ESCO taxonomy to select negative examples from
related skills yields the biggest improvements, and combining three different
strategies in one model further increases the performance, up to 8 percentage
points in RP@5. We introduce a manually annotated evaluation benchmark for
skill extraction based on the ESCO taxonomy, on which we validate our models.
We release the benchmark dataset for research purposes to stimulate further
research on the task.
Related papers
- KBAlign: Efficient Self Adaptation on Specific Knowledge Bases [75.78948575957081]
Large language models (LLMs) usually rely on retrieval-augmented generation to exploit knowledge materials in an instant manner.
We propose KBAlign, an approach designed for efficient adaptation to downstream tasks involving knowledge bases.
Our method utilizes iterative training with self-annotated data such as Q&A pairs and revision suggestions, enabling the model to grasp the knowledge content efficiently.
arXiv Detail & Related papers (2024-11-22T08:21:03Z) - Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration [54.8229698058649]
We study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies.
Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits.
We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks.
arXiv Detail & Related papers (2024-10-23T17:58:45Z) - Computational Job Market Analysis with Natural Language Processing [5.117211717291377]
This thesis investigates Natural Language Processing (NLP) technology for extracting relevant information from job descriptions.
We frame the problem, obtaining annotated data, and introducing extraction methodologies.
Our contributions include job description datasets, a de-identification dataset, and a novel active learning algorithm for efficient model training.
arXiv Detail & Related papers (2024-04-29T14:52:38Z) - Rethinking Skill Extraction in the Job Market Domain using Large
Language Models [20.256353240384133]
Skill Extraction involves identifying skills and qualifications mentioned in documents such as job postings and resumes.
The reliance on manually annotated data limits the generalizability of such approaches.
In this paper, we explore the use of in-context learning to overcome these challenges.
arXiv Detail & Related papers (2024-02-06T09:23:26Z) - NNOSE: Nearest Neighbor Occupational Skill Extraction [55.22292957778972]
We tackle the complexity in occupational skill datasets.
We employ an external datastore for retrieving similar skills in a dataset-unifying manner.
We observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.
arXiv Detail & Related papers (2024-01-30T15:18:29Z) - Extreme Multi-Label Skill Extraction Training using Large Language
Models [19.095612333241288]
We describe a cost-effective approach to generate an accurate, fully synthetic labeled dataset for skill extraction.
Our results show a consistent increase of between 15 to 25 percentage points in textitR-Precision@5 compared to previously published results.
arXiv Detail & Related papers (2023-07-20T11:29:15Z) - Large Language Models as Batteries-Included Zero-Shot ESCO Skills
Matchers [0.0]
We propose an end-to-end zero-shot system for skills extraction from job descriptions based on large language models (LLMs)
We generate synthetic training data for the entirety of ESCO skills and train a classifier to extract skill mentions from job posts.
We also employ a similarity retriever to generate skill candidates which are then re-ranked using a second LLM.
arXiv Detail & Related papers (2023-07-07T12:04:12Z) - "FIJO": a French Insurance Soft Skill Detection Dataset [0.0]
This article proposes a new public dataset, FIJO, containing insurance job offers, including many soft skill annotations.
We present the results of skill detection algorithms using a named entity recognition approach and show that transformers-based models have good token-wise performances on this dataset.
arXiv Detail & Related papers (2022-04-11T15:54:22Z) - Hierarchical Skills for Efficient Exploration [70.62309286348057]
In reinforcement learning, pre-trained low-level skills have the potential to greatly facilitate exploration.
Prior knowledge of the downstream task is required to strike the right balance between generality (fine-grained control) and specificity (faster learning) in skill design.
We propose a hierarchical skill learning framework that acquires skills of varying complexity in an unsupervised manner.
arXiv Detail & Related papers (2021-10-20T22:29:32Z) - Hierarchical Few-Shot Imitation with Skill Transition Models [66.81252581083199]
Few-shot Imitation with Skill Transition Models (FIST) is an algorithm that extracts skills from offline data and utilizes them to generalize to unseen tasks.
We show that FIST is capable of generalizing to new tasks and substantially outperforms prior baselines in navigation experiments.
arXiv Detail & Related papers (2021-07-19T15:56:01Z) - Reinforcement Learning with Prototypical Representations [114.35801511501639]
Proto-RL is a self-supervised framework that ties representation learning with exploration through prototypical representations.
These prototypes simultaneously serve as a summarization of the exploratory experience of an agent as well as a basis for representing observations.
This enables state-of-the-art downstream policy learning on a set of difficult continuous control tasks.
arXiv Detail & Related papers (2021-02-22T18:56:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.