Sensitive Data Detection with High-Throughput Neural Network Models for
Financial Institutions
- URL: http://arxiv.org/abs/2012.09597v1
- Date: Thu, 17 Dec 2020 14:11:03 GMT
- Title: Sensitive Data Detection with High-Throughput Neural Network Models for
Financial Institutions
- Authors: Anh Truong, Austin Walters, Jeremy Goodsitt
- Abstract summary: We use internal and synthetic datasets to evaluate various methods of detecting NPI (Nonpublic Personally Identifiable) information.
Character-level neural network models including CNN, LSTM, BiLSTM-CRF, and CNN-CRF are investigated on two prediction tasks.
- Score: 3.4161707164978137
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Named Entity Recognition has been extensively investigated in many fields.
However, the application of sensitive entity detection for production systems
in financial institutions has not been well explored due to the lack of
publicly available, labeled datasets. In this paper, we use internal and
synthetic datasets to evaluate various methods of detecting NPI (Nonpublic
Personally Identifiable) information commonly found within financial
institutions, in both unstructured and structured data formats. Character-level
neural network models including CNN, LSTM, BiLSTM-CRF, and CNN-CRF are
investigated on two prediction tasks: (i) entity detection on multiple data
formats, and (ii) column-wise entity prediction on tabular datasets. We compare
these models with other standard approaches on both real and synthetic data,
with respect to F1-score, precision, recall, and throughput. The real datasets
include internal structured data and public email data with manually tagged
labels. Our experimental results show that the CNN model is simple yet
effective with respect to accuracy and throughput and thus, is the most
suitable candidate model to be deployed in the production environment(s).
Finally, we provide several lessons learned on data limitations, data labelling
and the intrinsic overlap of data entities.
Related papers
- Approaching Metaheuristic Deep Learning Combos for Automated Data Mining [0.5419570023862531]
This work proposes a means of combining meta-heuristic methods with conventional classifiers and neural networks in order to perform automated data mining.
Experiments on the MNIST dataset for handwritten digit recognition were performed.
It was empirically observed that using a ground truth labeled dataset's validation accuracy is inadequate for correcting labels of other previously unseen data instances.
arXiv Detail & Related papers (2024-10-16T10:28:22Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - Unsupervised Domain Adaption for Neural Information Retrieval [18.97486314518283]
We compare synthetic annotation by query generation using Large Language Models or rule-based string manipulation.
We find that Large Language Models outperform rule-based methods in all scenarios by a large margin.
In addition we explore several sizes of open Large Language Models to generate synthetic data and find that a medium-sized model suffices.
arXiv Detail & Related papers (2023-10-13T18:27:33Z) - FairGen: Fair Synthetic Data Generation [0.3149883354098941]
We propose a pipeline to generate fairer synthetic data independent of the GAN architecture.
We claim that while generating synthetic data most GANs amplify bias present in the training data but by removing these bias inducing samples, GANs essentially focuses more on real informative samples.
arXiv Detail & Related papers (2022-10-24T08:13:47Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - MLReal: Bridging the gap between training on synthetic data and real
data applications in machine learning [1.9852463786440129]
We describe a novel approach to enhance supervised training on synthetic data with real data features.
In the training stage, the input data are from the synthetic domain and the auto-correlated data are from the real domain.
In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain.
arXiv Detail & Related papers (2021-09-11T14:43:34Z) - Rank-R FNN: A Tensor-Based Learning Model for High-Order Data
Classification [69.26747803963907]
Rank-R Feedforward Neural Network (FNN) is a tensor-based nonlinear learning model that imposes Canonical/Polyadic decomposition on its parameters.
First, it handles inputs as multilinear arrays, bypassing the need for vectorization, and can thus fully exploit the structural information along every data dimension.
We establish the universal approximation and learnability properties of Rank-R FNN, and we validate its performance on real-world hyperspectral datasets.
arXiv Detail & Related papers (2021-04-11T16:37:32Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.