A Large-scale Industrial and Professional Occupation Dataset
- URL: http://arxiv.org/abs/2005.02780v1
- Date: Sat, 25 Apr 2020 10:45:48 GMT
- Title: A Large-scale Industrial and Professional Occupation Dataset
- Authors: Junhua Liu, Yung Chuen Ng and Kwan Hui Lim
- Abstract summary: In today's job market, occupational data mining and analysis is growing in importance.
This dataset comprises 192k job titles belonging to 56k LinkedIn users.
- Score: 0.2642698101441705
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been growing interest in utilizing occupational data mining and
analysis. In today's job market, occupational data mining and analysis is
growing in importance as it enables companies to predict employee turnover,
model career trajectories, screen through resumes and perform other human
resource tasks. A key requirement to facilitate these tasks is the need for an
occupation-related dataset. However, most research use proprietary datasets or
do not make their dataset publicly available, thus impeding development in this
area. To solve this issue, we present the Industrial and Professional
Occupation Dataset (IPOD), which comprises 192k job titles belonging to 56k
LinkedIn users. In addition to making IPOD publicly available, we also: (i)
manually annotate each job title with its associated level of seniority, domain
of work and location; and (ii) provide embedding for job titles and discuss
various use cases. This dataset is publicly available at
https://github.com/junhua/ipod.
Related papers
- JobHop: A Large-Scale Dataset of Career Trajectories [48.881023210777585]
JobHop is a large-scale public dataset derived from anonymized resumes provided by VDAB, the public employment service in Flanders, Belgium.<n>We process unstructured resume data to extract structured career information, which is then mapped to standardized ESCO occupation codes.<n>This results in a rich dataset of over 2.3 million work experiences, extracted from and grouped into more than 391,000 user resumes.
arXiv Detail & Related papers (2025-05-12T15:22:29Z) - KARRIEREWEGE: A Large Scale Career Path Prediction Dataset [29.24421465266904]
We introduce KARRIEREWEGE, a comprehensive, publicly available dataset containing over 500k career paths.
To tackle the problem of free-text inputs typically found in resumes, we enhance it by synthesizing job titles and descriptions.
This allows for accurate predictions from unstructured data, closely aligning with real-world application challenges.
arXiv Detail & Related papers (2024-12-19T08:02:08Z) - RedStone: Curating General, Code, Math, and QA Data for Large Language Models [134.49774529790693]
This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training Large Language Models.
We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl.
arXiv Detail & Related papers (2024-12-04T15:27:39Z) - Job-SDF: A Multi-Granularity Dataset for Job Skill Demand Forecasting and Benchmarking [59.87055275344965]
Job-SDF is a dataset designed to train and benchmark job-skill demand forecasting models.
Based on 10.35 million public job advertisements collected from major online recruitment platforms in China between 2021 and 2023.
Our dataset uniquely enables evaluating skill demand forecasting models at various granularities, including occupation, company, and regional levels.
arXiv Detail & Related papers (2024-06-17T07:22:51Z) - NNOSE: Nearest Neighbor Occupational Skill Extraction [55.22292957778972]
We tackle the complexity in occupational skill datasets.
We employ an external datastore for retrieving similar skills in a dataset-unifying manner.
We observe a performance gain in predicting infrequent patterns, with substantial gains of up to 30% span-F1 in cross-dataset settings.
arXiv Detail & Related papers (2024-01-30T15:18:29Z) - Unearthing Large Scale Domain-Specific Knowledge from Public Corpora [103.0865116794534]
We introduce large models into the data collection pipeline to guide the generation of domain-specific information.<n>We refer to this approach as Retrieve-from-CC.<n>It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus.
arXiv Detail & Related papers (2024-01-26T03:38:23Z) - Large Models for Time Series and Spatio-Temporal Data: A Survey and
Outlook [95.32949323258251]
Temporal data, notably time series andtemporal-temporal data, are prevalent in real-world applications.
Recent advances in large language and other foundational models have spurred increased use in time series andtemporal data mining.
arXiv Detail & Related papers (2023-10-16T09:06:00Z) - A practical method for occupational skills detection in Vietnamese job
listings [0.16114012813668932]
Lack of accurate and timely labor market information leads to skill miss-matches.
Traditional approaches rely on existing taxonomy and/or large annotated data.
We propose a practical methodology for skill detection in Vietnamese job listings.
arXiv Detail & Related papers (2022-10-26T10:23:18Z) - CAREER: A Foundation Model for Labor Sequence Data [21.38386300423882]
We develop CAREER, a foundation model for job sequences.
CAREER is first fit to large, passively-collected resume data, then fine-tuned to smaller, better-curated datasets for economic inferences.
We find that CAREER forms accurate predictions of job sequences, outperforming econometric baselines on three widely-used economics datasets.
arXiv Detail & Related papers (2022-02-16T23:23:50Z) - Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain
Datasets [122.85598648289789]
We study how multi-domain and multi-task datasets can improve the learning of new tasks in new environments.
We also find that data for only a few tasks in a new domain can bridge the domain gap and make it possible for a robot to perform a variety of prior tasks that were only seen in other domains.
arXiv Detail & Related papers (2021-09-27T23:42:12Z) - Toward Knowledge Discovery Framework for Data Science Job Market in the
United States [1.7205106391379024]
This paper introduces a framework to analyze the job market for data science-related jobs within the US.
The proposed framework includes three sub-modules allowing continuous data collection, information extraction, and a web-based visualization dashboard.
The current version of this application is deployed on the web and allows individuals and institutes to investigate skills required for data science positions.
arXiv Detail & Related papers (2021-06-14T21:23:15Z) - Job2Vec: Job Title Benchmarking with Collective Multi-View
Representation Learning [51.34011135329063]
Job Title Benchmarking (JTB) aims at matching job titles with similar expertise levels across various companies.
Traditional JTB approaches mainly rely on manual market surveys, which is expensive and labor-intensive.
We reformulate the JTB as the task of link prediction over the Job-Graph that matched job titles should have links.
arXiv Detail & Related papers (2020-09-16T02:33:32Z) - Data science on industrial data -- Today's challenges in brown field
applications [0.0]
This paper shows state of the art and what to expect when working with stock machines in the field.
A major focus in this paper is on data collection which can be more cumbersome than most people might expect.
Data quality for machine learning applications is a challenge once leaving the laboratory.
arXiv Detail & Related papers (2020-06-10T10:05:16Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.