NNOSE: Nearest Neighbor Occupational Skill Extraction
- URL: http://arxiv.org/abs/2401.17092v1
- Date: Tue, 30 Jan 2024 15:18:29 GMT
- Title: NNOSE: Nearest Neighbor Occupational Skill Extraction
- Authors: Mike Zhang and Rob van der Goot and Min-Yen Kan and Barbara Plank
- Abstract summary: We tackle the complexity in occupational skill datasets.
We employ an external datastore for retrieving similar skills in a dataset-unifying manner.
We observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.
- Score: 55.22292957778972
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The labor market is changing rapidly, prompting increased interest in the
automatic extraction of occupational skills from text. With the advent of
English benchmark job description datasets, there is a need for systems that
handle their diversity well. We tackle the complexity in occupational skill
datasets tasks -- combining and leveraging multiple datasets for skill
extraction, to identify rarely observed skills within a dataset, and overcoming
the scarcity of skills across datasets. In particular, we investigate the
retrieval-augmentation of language models, employing an external datastore for
retrieving similar skills in a dataset-unifying manner. Our proposed method,
\textbf{N}earest \textbf{N}eighbor \textbf{O}ccupational \textbf{S}kill
\textbf{E}xtraction (NNOSE) effectively leverages multiple datasets by
retrieving neighboring skills from other datasets in the datastore. This
improves skill extraction \emph{without} additional fine-tuning. Crucially, we
observe a performance gain in predicting infrequent patterns, with substantial
gains of up to 30\% span-F1 in cross-dataset settings.
Related papers
- Adapt-$\infty$: Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for Lifelong Instruction Tuning.
We construct pseudo-skill clusters by grouping gradient-based sample vectors.
We select the best-performing data selector for each skill cluster from a pool of selector experts.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Automated Question Generation on Tabular Data for Conversational Data Exploration [1.2574534342156884]
We propose a system that recommends interesting questions in natural language based on relevant slices of a dataset in a conversational setting.
We use our own fine-tuned variation of a pre-trained language model(T5) to generate natural language questions in a specific manner.
arXiv Detail & Related papers (2024-07-10T08:07:05Z) - Computational Job Market Analysis with Natural Language Processing [5.117211717291377]
This thesis investigates Natural Language Processing (NLP) technology for extracting relevant information from job descriptions.
We frame the problem, obtaining annotated data, and introducing extraction methodologies.
Our contributions include job description datasets, a de-identification dataset, and a novel active learning algorithm for efficient model training.
arXiv Detail & Related papers (2024-04-29T14:52:38Z) - JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance
Skill Matching [18.94748873243611]
JobSkape is a framework to generate synthetic data for skill-to-taxonomy matching.
Within this framework, we create SkillSkape, a comprehensive open-source synthetic dataset of job postings.
We present a multi-step pipeline for skill extraction and matching tasks using large language models.
arXiv Detail & Related papers (2024-02-05T17:57:26Z) - Extreme Multi-Label Skill Extraction Training using Large Language
Models [19.095612333241288]
We describe a cost-effective approach to generate an accurate, fully synthetic labeled dataset for skill extraction.
Our results show a consistent increase of between 15 to 25 percentage points in textitR-Precision@5 compared to previously published results.
arXiv Detail & Related papers (2023-07-20T11:29:15Z) - Design of Negative Sampling Strategies for Distantly Supervised Skill
Extraction [19.43668931500507]
We propose an end-to-end system for skill extraction, based on distant supervision through literal matching.
We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements.
We release the benchmark dataset for research purposes to stimulate further research on the task.
arXiv Detail & Related papers (2022-09-13T13:37:06Z) - KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in
Few-Shot NLP [68.43279384561352]
Existing data augmentation algorithms leverage task-independent rules or fine-tune general-purpose pre-trained language models.
These methods have trivial task-specific knowledge and are limited to yielding low-quality synthetic data for weak baselines in simple tasks.
We propose the Knowledge Mixture Data Augmentation Model (KnowDA): an encoder-decoder LM pretrained on a mixture of diverse NLP tasks.
arXiv Detail & Related papers (2022-06-21T11:34:02Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Generative Conversational Networks [67.13144697969501]
We propose a framework called Generative Conversational Networks, in which conversational agents learn to generate their own labelled training data.
We show an average improvement of 35% in intent detection and 21% in slot tagging over a baseline model trained from the seed data.
arXiv Detail & Related papers (2021-06-15T23:19:37Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.