Related papers: Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction

Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction

URL: http://arxiv.org/abs/2209.05987v1
Date: Tue, 13 Sep 2022 13:37:06 GMT
Title: Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction
Authors: Jens-Joris Decorte, Jeroen Van Hautte, Johannes Deleu, Chris Develder and Thomas Demeester
Abstract summary: We propose an end-to-end system for skill extraction, based on distant supervision through literal matching. We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements. We release the benchmark dataset for research purposes to stimulate further research on the task.
Score: 19.43668931500507
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Skills play a central role in the job market and many human resources (HR) processes. In the wake of other digital experiences, today's online job market has candidates expecting to see the right opportunities based on their skill set. Similarly, enterprises increasingly need to use data to guarantee that the skills within their workforce remain future-proof. However, structured information about skills is often missing, and processes building on self- or manager-assessment have shown to struggle with issues around adoption, completeness, and freshness of the resulting data. Extracting skills is a highly challenging task, given the many thousands of possible skill labels mentioned either explicitly or merely described implicitly and the lack of finely annotated training corpora. Previous work on skill extraction overly simplifies the task to an explicit entity detection task or builds on manually annotated training data that would be infeasible if applied to a complete vocabulary of skills. We propose an end-to-end system for skill extraction, based on distant supervision through literal matching. We propose and evaluate several negative sampling strategies, tuned on a small validation dataset, to improve the generalization of skill extraction towards implicitly mentioned skills, despite the lack of such implicit skills in the distantly supervised data. We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements, and combining three different strategies in one model further increases the performance, up to 8 percentage points in RP@5. We introduce a manually annotated evaluation benchmark for skill extraction based on the ESCO taxonomy, on which we validate our models. We release the benchmark dataset for research purposes to stimulate further research on the task.

Related papers

Tec-Habilidad: Skill Classification for Bridging Education and Employment [0.7373617024876725]
This paper develops a Spanish language dataset for skill extraction and classification. It provides annotation methodology to distinguish between knowledge, skill, and abilities. It also provides deep learning baselines to advance robust solutions for skill classification.
arXiv Detail & Related papers (2025-03-05T22:05:42Z)
Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning. We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads. We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z)
KBAlign: Efficient Self Adaptation on Specific Knowledge Bases [75.78948575957081]
Large language models (LLMs) usually rely on retrieval-augmented generation to exploit knowledge materials in an instant manner. We propose KBAlign, an approach designed for efficient adaptation to downstream tasks involving knowledge bases. Our method utilizes iterative training with self-annotated data such as Q&A pairs and revision suggestions, enabling the model to grasp the knowledge content efficiently.
arXiv Detail & Related papers (2024-11-22T08:21:03Z)
Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration [54.8229698058649]
We study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks.
arXiv Detail & Related papers (2024-10-23T17:58:45Z)
Computational Job Market Analysis with Natural Language Processing [5.117211717291377]
This thesis investigates Natural Language Processing (NLP) technology for extracting relevant information from job descriptions. We frame the problem, obtaining annotated data, and introducing extraction methodologies. Our contributions include job description datasets, a de-identification dataset, and a novel active learning algorithm for efficient model training.
arXiv Detail & Related papers (2024-04-29T14:52:38Z)
Rethinking Skill Extraction in the Job Market Domain using Large Language Models [20.256353240384133]
Skill Extraction involves identifying skills and qualifications mentioned in documents such as job postings and resumes. The reliance on manually annotated data limits the generalizability of such approaches. In this paper, we explore the use of in-context learning to overcome these challenges.
arXiv Detail & Related papers (2024-02-06T09:23:26Z)
NNOSE: Nearest Neighbor Occupational Skill Extraction [55.22292957778972]
We tackle the complexity in occupational skill datasets. We employ an external datastore for retrieving similar skills in a dataset-unifying manner. We observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.
arXiv Detail & Related papers (2024-01-30T15:18:29Z)
Extreme Multi-Label Skill Extraction Training using Large Language Models [19.095612333241288]
We describe a cost-effective approach to generate an accurate, fully synthetic labeled dataset for skill extraction. Our results show a consistent increase of between 15 to 25 percentage points in textitR-Precision@5 compared to previously published results.
arXiv Detail & Related papers (2023-07-20T11:29:15Z)
Large Language Models as Batteries-Included Zero-Shot ESCO Skills Matchers [0.0]
We propose an end-to-end zero-shot system for skills extraction from job descriptions based on large language models (LLMs) We generate synthetic training data for the entirety of ESCO skills and train a classifier to extract skill mentions from job posts. We also employ a similarity retriever to generate skill candidates which are then re-ranked using a second LLM.
arXiv Detail & Related papers (2023-07-07T12:04:12Z)
"FIJO": a French Insurance Soft Skill Detection Dataset [0.0]
This article proposes a new public dataset, FIJO, containing insurance job offers, including many soft skill annotations. We present the results of skill detection algorithms using a named entity recognition approach and show that transformers-based models have good token-wise performances on this dataset.
arXiv Detail & Related papers (2022-04-11T15:54:22Z)
Hierarchical Skills for Efficient Exploration [70.62309286348057]
In reinforcement learning, pre-trained low-level skills have the potential to greatly facilitate exploration. Prior knowledge of the downstream task is required to strike the right balance between generality (fine-grained control) and specificity (faster learning) in skill design. We propose a hierarchical skill learning framework that acquires skills of varying complexity in an unsupervised manner.
arXiv Detail & Related papers (2021-10-20T22:29:32Z)
Hierarchical Few-Shot Imitation with Skill Transition Models [66.81252581083199]
Few-shot Imitation with Skill Transition Models (FIST) is an algorithm that extracts skills from offline data and utilizes them to generalize to unseen tasks. We show that FIST is capable of generalizing to new tasks and substantially outperforms prior baselines in navigation experiments.
arXiv Detail & Related papers (2021-07-19T15:56:01Z)
Reinforcement Learning with Prototypical Representations [114.35801511501639]
Proto-RL is a self-supervised framework that ties representation learning with exploration through prototypical representations. These prototypes simultaneously serve as a summarization of the exploratory experience of an agent as well as a basis for representing observations. This enables state-of-the-art downstream policy learning on a set of difficult continuous control tasks.
arXiv Detail & Related papers (2021-02-22T18:56:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.