Related papers: CAREER: A Foundation Model for Labor Sequence Data

CAREER: A Foundation Model for Labor Sequence Data

URL: http://arxiv.org/abs/2202.08370v4
Date: Thu, 29 Feb 2024 16:58:25 GMT
Title: CAREER: A Foundation Model for Labor Sequence Data
Authors: Keyon Vafa, Emil Palikot, Tianyu Du, Ayush Kanodia, Susan Athey, David M. Blei
Abstract summary: We develop CAREER, a foundation model for job sequences. CAREER is first fit to large, passively-collected resume data, then fine-tuned to smaller, better-curated datasets for economic inferences. We find that CAREER forms accurate predictions of job sequences, outperforming econometric baselines on three widely-used economics datasets.
Score: 21.38386300423882
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, standard econometric models cannot take advantage of their scale or incorporate them into the analysis of survey data. To this end we develop CAREER, a foundation model for job sequences. CAREER is first fit to large, passively-collected resume data and then fine-tuned to smaller, better-curated datasets for economic inferences. We fit CAREER to a dataset of 24 million job sequences from resumes, and adjust it on small longitudinal survey datasets. We find that CAREER forms accurate predictions of job sequences, outperforming econometric baselines on three widely-used economics datasets. We further find that CAREER can be used to form good predictions of other downstream variables. For example, incorporating CAREER into a wage model provides better predictions than the econometric models currently in use.

Related papers

JobHop: A Large-Scale Dataset of Career Trajectories [48.881023210777585]
JobHop is a large-scale public dataset derived from anonymized resumes provided by VDAB, the public employment service in Flanders, Belgium.<n>We process unstructured resume data to extract structured career information, which is then mapped to standardized ESCO occupation codes.<n>This results in a rich dataset of over 2.3 million work experiences, extracted from and grouped into more than 391,000 user resumes.
arXiv Detail & Related papers (2025-05-12T15:22:29Z)
Unemployment Dynamics Forecasting with Machine Learning Regression Models [1.9761774213809031]
In this paper, I explored how a range of regression and machine learning techniques can be applied to monthly U.S. unemployment data to produce timely forecasts.<n>I compared seven models: Linear Regression, SGDRegressor, Random Forest, XGBoost, CatBoost, Support Vector Regression, and an LSTM network.<n>Our study shows how modern machine-learning techniques can enhance real-time unemployment forecasting, offering economists and policymakers richer insights into labor market trends.
arXiv Detail & Related papers (2025-05-03T21:55:28Z)
Towards Data-Efficient Pretraining for Atomic Property Prediction [51.660835328611626]
We show that pretraining on a task-relevant dataset can match or surpass large-scale pretraining. We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr'echet Inception Distance.
arXiv Detail & Related papers (2025-02-16T11:46:23Z)
Predicting Large Language Model Capabilities on Closed-Book QA Tasks Using Only Information Available Prior to Training [51.60874286674908]
We focus on predicting performance on Closed-book Question Answering (CBQA) tasks, which are closely tied to pre-training data and knowledge retention. We address three major challenges: 1) mastering the entire pre-training process, especially data construction; 2) evaluating a model's knowledge retention; and 3) predicting task-specific knowledge retention using only information available prior to training. We introduce the SMI metric, an information-theoretic measure that quantifies the relationship between pre-training data, model size, and task-specific knowledge retention.
arXiv Detail & Related papers (2025-02-06T13:23:53Z)
KARRIEREWEGE: A Large Scale Career Path Prediction Dataset [29.24421465266904]
We introduce KARRIEREWEGE, a comprehensive, publicly available dataset containing over 500k career paths. To tackle the problem of free-text inputs typically found in resumes, we enhance it by synthesizing job titles and descriptions. This allows for accurate predictions from unstructured data, closely aligning with real-world application challenges.
arXiv Detail & Related papers (2024-12-19T08:02:08Z)
Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data. We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z)
Enriching Datasets with Demographics through Large Language Models: What's in a Name? [5.871504332441324]
Large Language Models (LLMs) can perform as well as, if not better than, bespoke models trained on specialized data. We apply these LLMs to a variety of datasets, including a real-life, unlabelled dataset of licensed financial professionals in Hong Kong.
arXiv Detail & Related papers (2024-09-17T18:40:49Z)
Estimating Wage Disparities Using Foundation Models [20.740346109417143]
We develop methods for fine-tuning foundation models to perform estimation problems. To demonstrate our ideas, we study gender wage decomposition. We use a custom-built foundation model to decompose the gender wage gap.
arXiv Detail & Related papers (2024-09-15T23:22:21Z)
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z)
Evaluating Pre-Training Bias on Severe Acute Respiratory Syndrome Dataset [0.0]
This work uses the severe acute respiratory syndrome dataset from OpenDataSUS to visualize three pre-training bias metrics. The aim is to compare the bias for the different regions, focusing on their protected attributes and comparing the model's performance with the metric values.
arXiv Detail & Related papers (2024-08-27T20:49:11Z)
LABOR-LLM: Language-Based Occupational Representations with Large Language Models [8.909328013944567]
This paper considers an alternative where the fine-tuning of the CAREER foundation model is replaced by fine-tuning LLMs. We show that our fine-tuned LLM-based models' predictions are more representative of the career trajectories of various workforce subpopulations than off-the-shelf LLM models and CAREER.
arXiv Detail & Related papers (2024-06-25T23:07:18Z)
Graphical vs. Deep Generative Models: Measuring the Impact of Differentially Private Mechanisms and Budgets on Utility [18.213030598476198]
We compare graphical and deep generative models, focusing on the key factors contributing to how privacy budgets are spent. We find that graphical models distribute privacy budgets horizontally and thus cannot handle relatively wide datasets for a fixed training time. Deep generative models spend their budgets per iteration, so their behavior is less predictable with varying dataset dimensions.
arXiv Detail & Related papers (2023-05-18T14:14:42Z)
Synthetic Model Combination: An Instance-wise Approach to Unsupervised Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data. Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z)
Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z)
When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data [84.87772675171412]
We study the circumstances under which explanations of individual data points can improve modeling performance. We make use of three existing datasets with explanations: e-SNLI, TACRED, SemEval.
arXiv Detail & Related papers (2021-02-03T18:57:08Z)
REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets [64.76453161039973]
REVISE (REvealing VIsual biaSEs) is a tool that assists in the investigation of a visual dataset. It surfacing potential biases along three dimensions: (1) object-based, (2) person-based, and (3) geography-based.
arXiv Detail & Related papers (2020-04-16T23:54:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.