Construction of English Resume Corpus and Test with Pre-trained Language
Models
- URL: http://arxiv.org/abs/2208.03219v1
- Date: Fri, 5 Aug 2022 15:07:23 GMT
- Title: Construction of English Resume Corpus and Test with Pre-trained Language
Models
- Authors: Chengguang Gan, Tatsunori Mori
- Abstract summary: This study aims to transform the information extraction task of resumes into a simple sentence classification task.
The classification rules are improved to create a larger and more fine-grained classification dataset of resumes.
This corpus is also used to test some current mainstream Pre-training language models (PLMs) performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Information extraction(IE) has always been one of the essential tasks of NLP.
Moreover, one of the most critical application scenarios of information
extraction is the information extraction of resumes. Constructed text is
obtained by classifying each part of the resume. It is convenient to store
these texts for later search and analysis. Furthermore, the constructed resume
data can also be used in the AI resume screening system. Significantly reduce
the labor cost of HR. This study aims to transform the information extraction
task of resumes into a simple sentence classification task. Based on the
English resume dataset produced by the prior study. The classification rules
are improved to create a larger and more fine-grained classification dataset of
resumes. This corpus is also used to test some current mainstream Pre-training
language models (PLMs) performance.Furthermore, in order to explore the
relationship between the number of training samples and the correctness rate of
the resume dataset, we also performed comparison experiments with training sets
of different train set sizes.The final multiple experimental results show that
the resume dataset with improved annotation rules and increased sample size of
the dataset improves the accuracy of the original resume dataset.
Related papers
- Summarization-based Data Augmentation for Document Classification [16.49709049899731]
We propose a simple yet effective summarization-based data augmentation, SUMMaug, for document classification.
We first obtain easy-to-learn examples for the target document classification task.
We then use the generated pseudo examples to perform curriculum learning.
arXiv Detail & Related papers (2023-12-01T11:34:37Z) - Unified Pretraining for Recommendation via Task Hypergraphs [55.98773629788986]
We propose a novel multitask pretraining framework named Unified Pretraining for Recommendation via Task Hypergraphs.
For a unified learning pattern to handle diverse requirements and nuances of various pretext tasks, we design task hypergraphs to generalize pretext tasks to hyperedge prediction.
A novel transitional attention layer is devised to discriminatively learn the relevance between each pretext task and recommendation.
arXiv Detail & Related papers (2023-10-20T05:33:21Z) - Resume Information Extraction via Post-OCR Text Processing [0.0]
It is aimed to extract information by classifying all of the text groups after pre-processing such as Optical Character Recognition.
The text dataset consists of 286 resumes collected for 5 different job descriptions in the IT industry.
The dataset created for object recognition consists of 1198 resumes, which were collected from the open-source internet and labeled as sets of text.
arXiv Detail & Related papers (2023-06-23T20:14:07Z) - Abstractive Text Summarization for Resumes With Cutting Edge NLP
Transformers and LSTM [0.0]
LSTM, pre-trained models, and fine-tuned models were assessed using a dataset of resumes.
The BART-Large model fine-tuned with the resume dataset gave the best performance.
arXiv Detail & Related papers (2023-06-23T06:33:20Z) - Zero-Shot Listwise Document Reranking with a Large Language Model [58.64141622176841]
We propose Listwise Reranker with a Large Language Model (LRL), which achieves strong reranking effectiveness without using any task-specific training data.
Experiments on three TREC web search datasets demonstrate that LRL not only outperforms zero-shot pointwise methods when reranking first-stage retrieval results, but can also act as a final-stage reranker.
arXiv Detail & Related papers (2023-05-03T14:45:34Z) - Bag of Tricks for Training Data Extraction from Language Models [98.40637430115204]
We investigate and benchmark tricks for improving training data extraction using a publicly available dataset.
The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction.
arXiv Detail & Related papers (2023-02-09T06:46:42Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Zero-Shot Information Extraction as a Unified Text-to-Triple Translation [56.01830747416606]
We cast a suite of information extraction tasks into a text-to-triple translation framework.
We formalize the task as a translation between task-specific input text and output triples.
We study the zero-shot performance of this framework on open information extraction.
arXiv Detail & Related papers (2021-09-23T06:54:19Z) - Back-Translated Task Adaptive Pretraining: Improving Accuracy and
Robustness on Text Classification [5.420446976940825]
We propose a back-translated task-adaptive pretraining (BT-TAPT) method that increases the amount of task-specific data for LM re-pretraining.
The experimental results show that the proposed BT-TAPT yields improved classification accuracy on both low- and high-resource data and better robustness to noise than the conventional adaptive pretraining method.
arXiv Detail & Related papers (2021-07-22T06:27:35Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.