CAREER: A Foundation Model for Labor Sequence Data
- URL: http://arxiv.org/abs/2202.08370v4
- Date: Thu, 29 Feb 2024 16:58:25 GMT
- Title: CAREER: A Foundation Model for Labor Sequence Data
- Authors: Keyon Vafa, Emil Palikot, Tianyu Du, Ayush Kanodia, Susan Athey, David
M. Blei
- Abstract summary: We develop CAREER, a foundation model for job sequences.
CAREER is first fit to large, passively-collected resume data, then fine-tuned to smaller, better-curated datasets for economic inferences.
We find that CAREER forms accurate predictions of job sequences, outperforming econometric baselines on three widely-used economics datasets.
- Score: 21.38386300423882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Labor economists regularly analyze employment data by fitting predictive
models to small, carefully constructed longitudinal survey datasets. Although
machine learning methods offer promise for such problems, these survey datasets
are too small to take advantage of them. In recent years large datasets of
online resumes have also become available, providing data about the career
trajectories of millions of individuals. However, standard econometric models
cannot take advantage of their scale or incorporate them into the analysis of
survey data. To this end we develop CAREER, a foundation model for job
sequences. CAREER is first fit to large, passively-collected resume data and
then fine-tuned to smaller, better-curated datasets for economic inferences. We
fit CAREER to a dataset of 24 million job sequences from resumes, and adjust it
on small longitudinal survey datasets. We find that CAREER forms accurate
predictions of job sequences, outperforming econometric baselines on three
widely-used economics datasets. We further find that CAREER can be used to form
good predictions of other downstream variables. For example, incorporating
CAREER into a wage model provides better predictions than the econometric
models currently in use.
Related papers
- Towards Data-Efficient Pretraining for Atomic Property Prediction [51.660835328611626]
We show that pretraining on a task-relevant dataset can match or surpass large-scale pretraining.
We introduce the Chemical Similarity Index (CSI), a novel metric inspired by computer vision's Fr'echet Inception Distance.
arXiv Detail & Related papers (2025-02-16T11:46:23Z) - Predicting Large Language Model Capabilities on Closed-Book QA Tasks Using Only Information Available Prior to Training [51.60874286674908]
We focus on predicting performance on Closed-book Question Answering (CBQA) tasks, which are closely tied to pre-training data and knowledge retention.
We address three major challenges: 1) mastering the entire pre-training process, especially data construction; 2) evaluating a model's knowledge retention; and 3) predicting task-specific knowledge retention using only information available prior to training.
We introduce the SMI metric, an information-theoretic measure that quantifies the relationship between pre-training data, model size, and task-specific knowledge retention.
arXiv Detail & Related papers (2025-02-06T13:23:53Z) - KARRIEREWEGE: A Large Scale Career Path Prediction Dataset [29.24421465266904]
We introduce KARRIEREWEGE, a comprehensive, publicly available dataset containing over 500k career paths.
To tackle the problem of free-text inputs typically found in resumes, we enhance it by synthesizing job titles and descriptions.
This allows for accurate predictions from unstructured data, closely aligning with real-world application challenges.
arXiv Detail & Related papers (2024-12-19T08:02:08Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Enriching Datasets with Demographics through Large Language Models: What's in a Name? [5.871504332441324]
Large Language Models (LLMs) can perform as well as, if not better than, bespoke models trained on specialized data.
We apply these LLMs to a variety of datasets, including a real-life, unlabelled dataset of licensed financial professionals in Hong Kong.
arXiv Detail & Related papers (2024-09-17T18:40:49Z) - Estimating Wage Disparities Using Foundation Models [20.740346109417143]
We develop methods for fine-tuning foundation models to perform estimation problems.
To demonstrate our ideas, we study gender wage decomposition.
We use a custom-built foundation model to decompose the gender wage gap.
arXiv Detail & Related papers (2024-09-15T23:22:21Z) - DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks.
This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions.
Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z) - Evaluating Pre-Training Bias on Severe Acute Respiratory Syndrome Dataset [0.0]
This work uses the severe acute respiratory syndrome dataset from OpenDataSUS to visualize three pre-training bias metrics.
The aim is to compare the bias for the different regions, focusing on their protected attributes and comparing the model's performance with the metric values.
arXiv Detail & Related papers (2024-08-27T20:49:11Z) - LABOR-LLM: Language-Based Occupational Representations with Large Language Models [8.909328013944567]
This paper considers an alternative where the fine-tuning of the CAREER foundation model is replaced by fine-tuning LLMs.
We show that our fine-tuned LLM-based models' predictions are more representative of the career trajectories of various workforce subpopulations than off-the-shelf LLM models and CAREER.
arXiv Detail & Related papers (2024-06-25T23:07:18Z) - Graphical vs. Deep Generative Models: Measuring the Impact of Differentially Private Mechanisms and Budgets on Utility [18.213030598476198]
We compare graphical and deep generative models, focusing on the key factors contributing to how privacy budgets are spent.
We find that graphical models distribute privacy budgets horizontally and thus cannot handle relatively wide datasets for a fixed training time.
Deep generative models spend their budgets per iteration, so their behavior is less predictable with varying dataset dimensions.
arXiv Detail & Related papers (2023-05-18T14:14:42Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data.
We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z) - When Can Models Learn From Explanations? A Formal Framework for
Understanding the Roles of Explanation Data [84.87772675171412]
We study the circumstances under which explanations of individual data points can improve modeling performance.
We make use of three existing datasets with explanations: e-SNLI, TACRED, SemEval.
arXiv Detail & Related papers (2021-02-03T18:57:08Z) - REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets [64.76453161039973]
REVISE (REvealing VIsual biaSEs) is a tool that assists in the investigation of a visual dataset.
It surfacing potential biases along three dimensions: (1) object-based, (2) person-based, and (3) geography-based.
arXiv Detail & Related papers (2020-04-16T23:54:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.