KARRIEREWEGE: A Large Scale Career Path Prediction Dataset
- URL: http://arxiv.org/abs/2412.14612v1
- Date: Thu, 19 Dec 2024 08:02:08 GMT
- Title: KARRIEREWEGE: A Large Scale Career Path Prediction Dataset
- Authors: Elena Senger, Yuri Campbell, Rob van der Goot, Barbara Plank,
- Abstract summary: We introduce KARRIEREWEGE, a comprehensive, publicly available dataset containing over 500k career paths.<n>To tackle the problem of free-text inputs typically found in resumes, we enhance it by synthesizing job titles and descriptions.<n>This allows for accurate predictions from unstructured data, closely aligning with real-world application challenges.
- Score: 29.24421465266904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate career path prediction can support many stakeholders, like job seekers, recruiters, HR, and project managers. However, publicly available data and tools for career path prediction are scarce. In this work, we introduce KARRIEREWEGE, a comprehensive, publicly available dataset containing over 500k career paths, significantly surpassing the size of previously available datasets. We link the dataset to the ESCO taxonomy to offer a valuable resource for predicting career trajectories. To tackle the problem of free-text inputs typically found in resumes, we enhance it by synthesizing job titles and descriptions resulting in KARRIEREWEGE+. This allows for accurate predictions from unstructured data, closely aligning with real-world application challenges. We benchmark existing state-of-the-art (SOTA) models on our dataset and a prior benchmark and observe improved performance and robustness, particularly for free-text use cases, due to the synthesized data.
Related papers
- Towards Data-Efficient Pretraining for Atomic Property Prediction [51.660835328611626]
We show that pretraining on a task-relevant dataset can match or surpass large-scale pretraining.
We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr'echet Inception Distance.
arXiv Detail & Related papers (2025-02-16T11:46:23Z) - Forecasting Future International Events: A Reliable Dataset for Text-Based Event Modeling [37.508538729757404]
WorldREP is a novel dataset designed to address limitations by leveraging the advanced reasoning capabilities of large-language models (LLMs)
Our dataset features high-quality scoring labels generated through advanced prompt modeling and rigorously validated by domain experts in political science.
We publicly release our dataset along with the full automation source code for data collection, labeling, and benchmarking, aiming to support and advance research in text-based event prediction.
arXiv Detail & Related papers (2024-11-21T11:44:23Z) - Enabling Advanced Land Cover Analytics: An Integrated Data Extraction Pipeline for Predictive Modeling with the Dynamic World Dataset [1.3757956340051605]
We present a flexible and efficient end to end pipeline for working with the Dynamic World dataset.
This includes a pre-processing and representation framework which tackles noise removal, efficient extraction of large amounts of data, and re-representation of LULC data.
To demonstrate the power of our pipeline, we use it to extract data for an urbanization prediction problem and build a suite of machine learning models with excellent performance.
arXiv Detail & Related papers (2024-10-11T16:13:01Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.
We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.
Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - How Much Data are Enough? Investigating Dataset Requirements for Patch-Based Brain MRI Segmentation Tasks [74.21484375019334]
Training deep neural networks reliably requires access to large-scale datasets.
To mitigate both the time and financial costs associated with model development, a clear understanding of the amount of data required to train a satisfactory model is crucial.
This paper proposes a strategic framework for estimating the amount of annotated data required to train patch-based segmentation networks.
arXiv Detail & Related papers (2024-04-04T13:55:06Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Career Path Prediction using Resume Representation Learning and
Skill-based Matching [14.635764829230398]
We present a novel representation learning approach, CareerBERT, specifically designed for work history data.
We develop a skill-based model and a text-based model for career path prediction, which achieve 35.24% and 39.61% recall@10 respectively.
arXiv Detail & Related papers (2023-10-24T08:56:06Z) - LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting [65.71129509623587]
Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning.
However, the promising results achieved on current public datasets may not be applicable to practical scenarios.
We introduce the LargeST benchmark dataset, which includes a total of 8,600 sensors in California with a 5-year time coverage.
arXiv Detail & Related papers (2023-06-14T05:48:36Z) - The Stanford Drone Dataset is More Complex than We Think: An Analysis of
Key Characteristics [2.064612766965483]
We discuss the characteristics of the Stanford Drone dataset (SDD)
We demonstrate how this insufficiency reduces the information available to users and can impact performance.
Our intention is to increase the performance and methods applied to this dataset going forward, while also clearly detailing less obvious features of the dataset for new users.
arXiv Detail & Related papers (2022-03-22T13:58:14Z) - CAREER: A Foundation Model for Labor Sequence Data [21.38386300423882]
We develop CAREER, a foundation model for job sequences.
CAREER is first fit to large, passively-collected resume data, then fine-tuned to smaller, better-curated datasets for economic inferences.
We find that CAREER forms accurate predictions of job sequences, outperforming econometric baselines on three widely-used economics datasets.
arXiv Detail & Related papers (2022-02-16T23:23:50Z) - Injecting Knowledge in Data-driven Vehicle Trajectory Predictors [82.91398970736391]
Vehicle trajectory prediction tasks have been commonly tackled from two perspectives: knowledge-driven or data-driven.
In this paper, we propose to learn a "Realistic Residual Block" (RRB) which effectively connects these two perspectives.
Our proposed method outputs realistic predictions by confining the residual range and taking into account its uncertainty.
arXiv Detail & Related papers (2021-03-08T16:03:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.