De-identification of Privacy-related Entities in Job Postings
- URL: http://arxiv.org/abs/2105.11223v1
- Date: Mon, 24 May 2021 12:01:22 GMT
- Title: De-identification of Privacy-related Entities in Job Postings
- Authors: Kristian N{\o}rgaard Jensen, Mike Zhang, Barbara Plank
- Abstract summary: De-identification is the task of detecting privacy-related entities in text, such as person names, emails and contact data.
We present JobStack, a new corpus for de-identification of personal data in job vacancies on Stackoverflow.
- Score: 10.751883216434717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: De-identification is the task of detecting privacy-related entities in text,
such as person names, emails and contact data. It has been well-studied within
the medical domain. The need for de-identification technology is increasing, as
privacy-preserving data handling is in high demand in many domains. In this
paper, we focus on job postings. We present JobStack, a new corpus for
de-identification of personal data in job vacancies on Stackoverflow. We
introduce baselines, comparing Long-Short Term Memory (LSTM) and Transformer
models. To improve upon these baselines, we experiment with contextualized
embeddings and distantly related auxiliary data via multi-task learning. Our
results show that auxiliary data improves de-identification performance.
Surprisingly, vanilla BERT turned out to be more effective than a BERT model
trained on other portions of Stackoverflow.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework [1.9489823192518083]
The goal of this proposal is to develop a text de-identification framework, which can be easily adapted to the specific domain.
We propose an aspect-based utility-preserved de-identification summarization framework, AspirinSum, by learning to align expert's aspect from existing comment data.
We envision that the de-identified text can then be used in data publishing, eventually publishing our de-identified dataset for downstream task use.
arXiv Detail & Related papers (2024-06-20T02:29:46Z) - TAROT: A Hierarchical Framework with Multitask Co-Pretraining on
Semi-Structured Data towards Effective Person-Job Fit [60.31175803899285]
We propose TAROT, a hierarchical multitask co-pretraining framework, to better utilize structural and semantic information for informative text embeddings.
TAROT targets semi-structured text in profiles and jobs, and it is co-pretained with multi-grained pretraining tasks to constrain the acquired semantic information at each level.
arXiv Detail & Related papers (2024-01-15T07:57:58Z) - Data-Driven but Privacy-Conscious: Pedestrian Dataset De-identification
via Full-Body Person Synthesis [16.394031759681678]
We motivate and introduce the Pedestrian dataset De-Identification task.
PDI evaluates the degree of de-identification and downstream task training performance for a given de-identification method.
We show how our data is able to narrow the synthetic-to-real performance gap in a privacy-conscious manner.
arXiv Detail & Related papers (2023-06-20T17:39:24Z) - Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study [68.75670223005716]
We find that pre-trained language models like BERT have a potential ability to learn sequentially, even without any sparse memory replay.
Our experiments reveal that BERT can actually generate high quality representations for previously learned tasks in a long term, under extremely sparse replay or even no replay.
arXiv Detail & Related papers (2023-03-02T09:03:43Z) - Memorization of Named Entities in Fine-tuned BERT Models [3.0177210416625115]
We investigate the extent of named entity memorization in fine-tuned BERT models.
We show that a fine-tuned BERT does not generate more named entities specific to the fine-tuning dataset than a BERT model that is pre-trained only.
arXiv Detail & Related papers (2022-12-07T16:20:50Z) - Retrieval Enhanced Data Augmentation for Question Answering on Privacy
Policies [74.01792675564218]
We develop a data augmentation framework based on ensembling retriever models that captures relevant text segments from unlabeled policy documents.
To improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models.
Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10% F1) and achieve a new state-of-the-art F1 score of 50%.
arXiv Detail & Related papers (2022-04-19T15:45:23Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - TIPRDC: Task-Independent Privacy-Respecting Data Crowdsourcing Framework
for Deep Learning with Anonymized Intermediate Representations [49.20701800683092]
We present TIPRDC, a task-independent privacy-respecting data crowdsourcing framework with anonymized intermediate representation.
The goal of this framework is to learn a feature extractor that can hide the privacy information from the intermediate representations; while maximally retaining the original information embedded in the raw data for the data collector to accomplish unknown learning tasks.
arXiv Detail & Related papers (2020-05-23T06:21:26Z) - Sensitive Data Detection and Classification in Spanish Clinical Text:
Experiments with BERT [0.8379286663107844]
In this paper, we use a BERT-based sequence labelling model to conduct anonymisation experiments in Spanish.
Experiments show that a simple BERT-based model with general-domain pre-training obtains highly competitive results without any domain specific feature engineering.
arXiv Detail & Related papers (2020-03-06T09:46:51Z) - What BERT Sees: Cross-Modal Transfer for Visual Question Generation [21.640299110619384]
We study the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data.
We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations.
arXiv Detail & Related papers (2020-02-25T12:44:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.