Related papers: JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching

JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching

URL: http://arxiv.org/abs/2402.03242v1
Date: Mon, 5 Feb 2024 17:57:26 GMT
Title: JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance Skill Matching
Authors: Antoine Magron, Anna Dai, Mike Zhang, Syrielle Montariol, Antoine Bosselut
Abstract summary: JobSkape is a framework to generate synthetic data for skill-to-taxonomy matching. Within this framework, we create SkillSkape, a comprehensive open-source synthetic dataset of job postings. We present a multi-step pipeline for skill extraction and matching tasks using large language models.
Score: 18.94748873243611
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent approaches in skill matching, employing synthetic training data for classification or similarity model training, have shown promising results, reducing the need for time-consuming and expensive annotations. However, previous synthetic datasets have limitations, such as featuring only one skill per sentence and generally comprising short sentences. In this paper, we introduce JobSkape, a framework to generate synthetic data that tackles these limitations, specifically designed to enhance skill-to-taxonomy matching. Within this framework, we create SkillSkape, a comprehensive open-source synthetic dataset of job postings tailored for skill-matching tasks. We introduce several offline metrics that show that our dataset resembles real-world data. Additionally, we present a multi-step pipeline for skill extraction and matching tasks using large language models (LLMs), benchmarking against known supervised methodologies. We outline that the downstream evaluation results on real-world data can beat baselines, underscoring its efficacy and adaptability.

Related papers

RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs [3.41612427812159]
In digital content creation tools, users express their needs through natural language queries that must be mapped to API calls.<n>Existing approaches to synthetic data generation fail to replicate real-world data distributions.<n>We present a novel router-based architecture that generates high-quality synthetic training data.
arXiv Detail & Related papers (2025-05-15T16:53:45Z)
Synthesize-on-Graph: Knowledgeable Synthetic Data Generation for Continue Pre-training of Large Language Models [8.299006259255572]
We propose Synthetic-on-Graph (SoG), a synthetic data generation framework that incorporates cross-document knowledge associations for efficient corpus expansion.<n>SoG constructs a context graph by extracting entities and concepts from the original corpus, representing cross-document associations.<n>To further improve synthetic data quality, we integrate Chain-of-Thought (CoT) and Contrastive Clarifying (CC) synthetic, enhancing reasoning processes and discriminative power.
arXiv Detail & Related papers (2025-05-02T03:40:39Z)
Generate to Discriminate: Expert Routing for Continual Learning [59.71853576559306]
Generate to Discriminate (G2D) is a continual learning method that leverages synthetic data to train a domain-discriminator. We observe that G2D outperforms competitive domain-incremental learning methods on tasks in both vision and language modalities.
arXiv Detail & Related papers (2024-12-22T13:16:28Z)
Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data [7.603659241572307]
We propose a novel UCB-based training procedure combined with a dynamic usability metric. Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets. We show that our metric is an effective way to rank synthetic images based on their usability.
arXiv Detail & Related papers (2024-12-06T23:36:36Z)
Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z)
How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics [49.9329723199239]
We propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset.
arXiv Detail & Related papers (2024-10-04T13:39:21Z)
CTG-KrEW: Generating Synthetic Structured Contextually Correlated Content by Conditional Tabular GAN with K-Means Clustering and Efficient Word Embedding [12.072052949955385]
Conditional Tabular Generative Adversarial Networks (CTGAN) are attractive for their ability to efficiently create synthetic data. We introduce a novel framework, CTGKrEW, which is adept at generating realistic synthetic data where attributes are collections of semantically and contextually coherent words. CTGKrEW also takes around 99% less CPU time and 33% less memory footprints than the conventional approach.
arXiv Detail & Related papers (2024-09-03T05:53:57Z)
EPIC: Effective Prompting for Imbalanced-Class Data Synthesis in Tabular Data Classification via Large Language Models [39.347666307218006]
Large language models (LLMs) have demonstrated remarkable in-context learning capabilities across diverse applications. We introduce EPIC, a novel approach that leverages balanced, grouped data samples and consistent formatting with unique variable mapping to guide LLMs in generating accurate synthetic data across all classes, even for imbalanced datasets.
arXiv Detail & Related papers (2024-04-15T17:49:16Z)
NNOSE: Nearest Neighbor Occupational Skill Extraction [55.22292957778972]
We tackle the complexity in occupational skill datasets. We employ an external datastore for retrieving similar skills in a dataset-unifying manner. We observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.
arXiv Detail & Related papers (2024-01-30T15:18:29Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method [0.0]
This work shows the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative.
arXiv Detail & Related papers (2023-10-10T12:29:57Z)
Effective Few-Shot Named Entity Linking by Meta-Learning [34.70028855572534]
We propose a novel weak supervision strategy to generate non-trivial synthetic entity-mention pairs. We also design a meta-learning mechanism to assign different weights to each synthetic entity-mention pair automatically. Experiments on real-world datasets show that the proposed method can extensively improve the state-of-the-art few-shot entity linking model.
arXiv Detail & Related papers (2022-07-12T03:23:02Z)
Synthetic Benchmarks for Scientific Research in Explainable Machine Learning [14.172740234933215]
We release XAI-Bench: a suite of synthetic datasets and a library for benchmarking feature attribution algorithms. Unlike real-world datasets, synthetic datasets allow the efficient computation of conditional expected values. We demonstrate the power of our library by benchmarking popular explainability techniques across several evaluation metrics and identifying failure modes for popular explainers.
arXiv Detail & Related papers (2021-06-23T17:10:21Z)
Unsupervised Opinion Summarization with Content Planning [58.5308638148329]
We show that explicitly incorporating content planning in a summarization model yields output of higher quality. We also create synthetic datasets which are more natural, resembling real world document-summary pairs. Our approach outperforms competitive models in generating informative, coherent, and fluent summaries.
arXiv Detail & Related papers (2020-12-14T18:41:58Z)
How Useful is Self-Supervised Pretraining for Visual Tasks? [133.1984299177874]
We evaluate various self-supervised algorithms across a comprehensive array of synthetic datasets and downstream tasks. Our experiments offer insights into how the utility of self-supervision changes as the number of available labels grows.
arXiv Detail & Related papers (2020-03-31T16:03:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.