Extreme Multi-Label Skill Extraction Training using Large Language
Models
- URL: http://arxiv.org/abs/2307.10778v1
- Date: Thu, 20 Jul 2023 11:29:15 GMT
- Title: Extreme Multi-Label Skill Extraction Training using Large Language
Models
- Authors: Jens-Joris Decorte, Severine Verlinden, Jeroen Van Hautte, Johannes
Deleu, Chris Develder and Thomas Demeester
- Abstract summary: We describe a cost-effective approach to generate an accurate, fully synthetic labeled dataset for skill extraction.
Our results show a consistent increase of between 15 to 25 percentage points in textitR-Precision@5 compared to previously published results.
- Score: 19.095612333241288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online job ads serve as a valuable source of information for skill
requirements, playing a crucial role in labor market analysis and e-recruitment
processes. Since such ads are typically formatted in free text, natural
language processing (NLP) technologies are required to automatically process
them. We specifically focus on the task of detecting skills (mentioned
literally, or implicitly described) and linking them to a large skill ontology,
making it a challenging case of extreme multi-label classification (XMLC).
Given that there is no sizable labeled (training) dataset are available for
this specific XMLC task, we propose techniques to leverage general Large
Language Models (LLMs). We describe a cost-effective approach to generate an
accurate, fully synthetic labeled dataset for skill extraction, and present a
contrastive learning strategy that proves effective in the task. Our results
across three skill extraction benchmarks show a consistent increase of between
15 to 25 percentage points in \textit{R-Precision@5} compared to previously
published results that relied solely on distant supervision through literal
matches.
Related papers
- Computational Job Market Analysis with Natural Language Processing [5.117211717291377]
This thesis investigates Natural Language Processing (NLP) technology for extracting relevant information from job descriptions.
We frame the problem, obtaining annotated data, and introducing extraction methodologies.
Our contributions include job description datasets, a de-identification dataset, and a novel active learning algorithm for efficient model training.
arXiv Detail & Related papers (2024-04-29T14:52:38Z) - Rethinking Skill Extraction in the Job Market Domain using Large
Language Models [20.256353240384133]
Skill Extraction involves identifying skills and qualifications mentioned in documents such as job postings and resumes.
The reliance on manually annotated data limits the generalizability of such approaches.
In this paper, we explore the use of in-context learning to overcome these challenges.
arXiv Detail & Related papers (2024-02-06T09:23:26Z) - NNOSE: Nearest Neighbor Occupational Skill Extraction [55.22292957778972]
We tackle the complexity in occupational skill datasets.
We employ an external datastore for retrieving similar skills in a dataset-unifying manner.
We observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.
arXiv Detail & Related papers (2024-01-30T15:18:29Z) - Automated Few-shot Classification with Instruction-Finetuned Language
Models [76.69064714392165]
We show that AuT-Few outperforms state-of-the-art few-shot learning methods.
We also show that AuT-Few is the best ranking method across datasets on the RAFT few-shot benchmark.
arXiv Detail & Related papers (2023-05-21T21:50:27Z) - Design of Negative Sampling Strategies for Distantly Supervised Skill
Extraction [19.43668931500507]
We propose an end-to-end system for skill extraction, based on distant supervision through literal matching.
We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements.
We release the benchmark dataset for research purposes to stimulate further research on the task.
arXiv Detail & Related papers (2022-09-13T13:37:06Z) - "FIJO": a French Insurance Soft Skill Detection Dataset [0.0]
This article proposes a new public dataset, FIJO, containing insurance job offers, including many soft skill annotations.
We present the results of skill detection algorithms using a named entity recognition approach and show that transformers-based models have good token-wise performances on this dataset.
arXiv Detail & Related papers (2022-04-11T15:54:22Z) - Zero-Shot Information Extraction as a Unified Text-to-Triple Translation [56.01830747416606]
We cast a suite of information extraction tasks into a text-to-triple translation framework.
We formalize the task as a translation between task-specific input text and output triples.
We study the zero-shot performance of this framework on open information extraction.
arXiv Detail & Related papers (2021-09-23T06:54:19Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models.
Self-training serves as an effective mechanism to learn from large amounts of unlabeled data.
meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z) - Self-training Improves Pre-training for Natural Language Understanding [63.78927366363178]
We study self-training as another way to leverage unlabeled data through semi-supervised learning.
We introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data.
Our approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks.
arXiv Detail & Related papers (2020-10-05T17:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.