Related papers: conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers

conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers

URL: http://arxiv.org/abs/2109.06501v1
Date: Tue, 14 Sep 2021 07:57:05 GMT
Title: conSultantBERT: Fine-tuned Siamese Sentence-BERT for Matching Jobs and Job Seekers
Authors: Dor Lavi, Volodymyr Medentsiy, David Graus
Abstract summary: We explain our task where noisy data from parsed resumes, heterogeneous nature of the different sources of data, and crosslinguality and multilinguality present domain-specific challenges. We address these challenges by fine-tuning a Siamese Sentence Siamese-BERT (SBERT) model, which we call conSultantBERT, using a large-scale, real-world, and high quality dataset of over 270,000 resume-vacancy pairs labeled by our staffing consultants. We show how our fine-tuned model significantly outperforms unsupervised and supervised baselines that rely on TF-IDF-weighted feature vectors and BERT embeddings
Score: 2.208694022993555
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we focus on constructing useful embeddings of textual information in vacancies and resumes, which we aim to incorporate as features into job to job seeker matching models alongside other features. We explain our task where noisy data from parsed resumes, heterogeneous nature of the different sources of data, and crosslinguality and multilinguality present domain-specific challenges. We address these challenges by fine-tuning a Siamese Sentence-BERT (SBERT) model, which we call conSultantBERT, using a large-scale, real-world, and high quality dataset of over 270,000 resume-vacancy pairs labeled by our staffing consultants. We show how our fine-tuned model significantly outperforms unsupervised and supervised baselines that rely on TF-IDF-weighted feature vectors and BERT embeddings. In addition, we find our model successfully matches cross-lingual and multilingual textual content.

Related papers

Enhancing Multilingual Language Models for Code-Switched Input Data [0.0]
This research investigates if pre-training Multilingual BERT (mBERT) on code-switched datasets improves the model's performance on critical NLP tasks. We use a dataset of Spanglish tweets for pre-training and evaluate the pre-trained model against a baseline model. Our findings show that our pre-trained mBERT model outperforms or matches the baseline model in the given tasks, with the most significant improvements seen for parts of speech tagging.
arXiv Detail & Related papers (2025-03-11T02:49:41Z)
MULTITAT: Benchmarking Multilingual Table-and-Text Question Answering [44.89146464166763]
Existing TATQA datasets are limited to English. They overlook the challenges of multilingual TAT-QA. They do not reflect real-world scenarios where tables and texts frequently appear in non-English languages.
arXiv Detail & Related papers (2025-02-24T15:34:09Z)
Multi-Task Learning for Front-End Text Processing in TTS [15.62497569424995]
We propose a multi-task learning (MTL) model for jointly performing three tasks that are commonly solved in a text-to-speech front-end. Our framework utilizes a tree-like structure with a trunk that learns shared representations, followed by separate task-specific heads.
arXiv Detail & Related papers (2024-01-12T02:13:21Z)
ToddlerBERTa: Exploiting BabyBERTa for Grammar Learning and Language Understanding [0.0]
We present ToddlerBERTa, a BabyBERTa-like language model, exploring its capabilities through five different models. We find that smaller models can excel in specific tasks, while larger models perform well with substantial data. ToddlerBERTa demonstrates commendable performance, rivalling the state-of-the-art RoBERTa-base.
arXiv Detail & Related papers (2023-08-30T21:56:36Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese [4.941630596191806]
We propose a set of intrinsic evaluation tasks that inspect the linguistic information encoded in models developed for Brazilian Portuguese. These tasks are designed to evaluate how different language models generalise information related to grammatical structures and multiword expressions.
arXiv Detail & Related papers (2023-05-23T13:49:14Z)
FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances. We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID. The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z)
Claim Matching Beyond English to Scale Global Fact-Checking [5.836354423653351]
We construct a novel dataset of WhatsApp tipline and public group messages alongside fact-checked claims. Our dataset contains content in high-resource (English, Hindi) and lower-resource (Bengali, Malayalam, Tamil) languages. We train our own embedding model using knowledge distillation and a high-quality "teacher" model in order to address the imbalance in embedding quality between the low- and high-resource languages.
arXiv Detail & Related papers (2021-06-01T23:28:05Z)
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus. Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks. Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks. We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)
CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT. Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z)
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data [113.29476656550342]
We present TaBERT, a pretrained LM that jointly learns representations for NL sentences and tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. Implementation of the model will be available at http://fburl.com/TaBERT.
arXiv Detail & Related papers (2020-05-17T17:26:40Z)
Cross-lingual Information Retrieval with BERT [8.052497255948046]
We explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents. A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision. Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.
arXiv Detail & Related papers (2020-04-24T23:32:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.