conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and
Job Seekers
- URL: http://arxiv.org/abs/2109.06501v1
- Date: Tue, 14 Sep 2021 07:57:05 GMT
- Title: conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and
Job Seekers
- Authors: Dor Lavi, Volodymyr Medentsiy, David Graus
- Abstract summary: We explain our task where noisy data from parsed resumes, heterogeneous nature of the different sources of data, and crosslinguality and multilinguality present domain-specific challenges.
We address these challenges by fine-tuning a Siamese Sentence Siamese-BERT (SBERT) model, which we call conSultantBERT, using a large-scale, real-world, and high quality dataset of over 270,000 resume-vacancy pairs labeled by our staffing consultants.
We show how our fine-tuned model significantly outperforms unsupervised and supervised baselines that rely on TF-IDF-weighted feature vectors and BERT embeddings
- Score: 2.208694022993555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper we focus on constructing useful embeddings of textual
information in vacancies and resumes, which we aim to incorporate as features
into job to job seeker matching models alongside other features. We explain our
task where noisy data from parsed resumes, heterogeneous nature of the
different sources of data, and crosslinguality and multilinguality present
domain-specific challenges.
We address these challenges by fine-tuning a Siamese Sentence-BERT (SBERT)
model, which we call conSultantBERT, using a large-scale, real-world, and high
quality dataset of over 270,000 resume-vacancy pairs labeled by our staffing
consultants. We show how our fine-tuned model significantly outperforms
unsupervised and supervised baselines that rely on TF-IDF-weighted feature
vectors and BERT embeddings. In addition, we find our model successfully
matches cross-lingual and multilingual textual content.
Related papers
- Multi-Task Learning for Front-End Text Processing in TTS [15.62497569424995]
We propose a multi-task learning (MTL) model for jointly performing three tasks that are commonly solved in a text-to-speech front-end.
Our framework utilizes a tree-like structure with a trunk that learns shared representations, followed by separate task-specific heads.
arXiv Detail & Related papers (2024-01-12T02:13:21Z) - ToddlerBERTa: Exploiting BabyBERTa for Grammar Learning and Language
Understanding [0.0]
We present ToddlerBERTa, a BabyBERTa-like language model, exploring its capabilities through five different models.
We find that smaller models can excel in specific tasks, while larger models perform well with substantial data.
ToddlerBERTa demonstrates commendable performance, rivalling the state-of-the-art RoBERTa-base.
arXiv Detail & Related papers (2023-08-30T21:56:36Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - Assessing Linguistic Generalisation in Language Models: A Dataset for
Brazilian Portuguese [4.941630596191806]
We propose a set of intrinsic evaluation tasks that inspect the linguistic information encoded in models developed for Brazilian Portuguese.
These tasks are designed to evaluate how different language models generalise information related to grammatical structures and multiword expressions.
arXiv Detail & Related papers (2023-05-23T13:49:14Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - Claim Matching Beyond English to Scale Global Fact-Checking [5.836354423653351]
We construct a novel dataset of WhatsApp tipline and public group messages alongside fact-checked claims.
Our dataset contains content in high-resource (English, Hindi) and lower-resource (Bengali, Malayalam, Tamil) languages.
We train our own embedding model using knowledge distillation and a high-quality "teacher" model in order to address the imbalance in embedding quality between the low- and high-resource languages.
arXiv Detail & Related papers (2021-06-01T23:28:05Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z) - TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables.
TaBERT is trained on a large corpus of 26 million tables and their English contexts.
Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z) - Cross-lingual Information Retrieval with BERT [8.052497255948046]
We explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents.
A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision.
Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.
arXiv Detail & Related papers (2020-04-24T23:32:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.