Unsupervised Anomaly Detection for Auditing Data and Impact of
Categorical Encodings
- URL: http://arxiv.org/abs/2210.14056v2
- Date: Wed, 26 Oct 2022 04:03:43 GMT
- Title: Unsupervised Anomaly Detection for Auditing Data and Impact of
Categorical Encodings
- Authors: Ajay Chawda, Stefanie Grimm, Marius Kloft
- Abstract summary: Vehicle Claims dataset consists of fraudulent insurance claims for automotive repairs.
We tackle the common problem of missing benchmark datasets for anomaly detection.
The dataset is evaluated on shallow and deep learning methods.
- Score: 20.37092575427039
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce the Vehicle Claims dataset, consisting of
fraudulent insurance claims for automotive repairs. The data belongs to the
more broad category of Auditing data, which includes also Journals and Network
Intrusion data. Insurance claim data are distinctively different from other
auditing data (such as network intrusion data) in their high number of
categorical attributes. We tackle the common problem of missing benchmark
datasets for anomaly detection: datasets are mostly confidential, and the
public tabular datasets do not contain relevant and sufficient categorical
attributes. Therefore, a large-sized dataset is created for this purpose and
referred to as Vehicle Claims (VC) dataset. The dataset is evaluated on shallow
and deep learning methods. Due to the introduction of categorical attributes,
we encounter the challenge of encoding them for the large dataset. As One Hot
encoding of high cardinal dataset invokes the "curse of dimensionality", we
experiment with GEL encoding and embedding layer for representing categorical
attributes. Our work compares competitive learning, reconstruction-error,
density estimation and contrastive learning approaches for Label, One Hot, GEL
encoding and embedding layer to handle categorical values.
Related papers
- Attribute-Based Semantic Type Detection and Data Quality Assessment [0.5735035463793008]
This research introduces an innovative methodology centered around Attribute-Based Semantic Type Detection and Data Quality Assessment.
By leveraging semantic information within attribute labels, combined with rule-based analysis and comprehensive Formats and Abbreviations dictionaries, our approach introduces a practical semantic type classification system.
A comparative analysis with Sherlock, a state-of-the-art Semantic Type Detection system, shows the advantages of our approach.
arXiv Detail & Related papers (2024-10-04T09:22:44Z) - DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking [58.37644304554906]
We propose Data Retrieval with Error-corrected codes and Watermarking (DREW)
DREW randomly clusters the reference dataset and injects unique error-controlled watermark keys into each cluster.
After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches.
arXiv Detail & Related papers (2024-06-05T01:19:44Z) - Casual Conversations v2: Designing a large consent-driven dataset to
measure algorithmic bias and robustness [34.435124846961415]
Meta is working on collecting a large consent-driven dataset with a comprehensive list of categories.
This paper describes our proposed design of such categories and subcategories for Casual Conversations v2.
arXiv Detail & Related papers (2022-11-10T19:06:21Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Learning Semantic Segmentation from Multiple Datasets with Label Shifts [101.24334184653355]
This paper proposes UniSeg, an effective approach to automatically train models across multiple datasets with differing label spaces.
Specifically, we propose two losses that account for conflicting and co-occurring labels to achieve better generalization performance in unseen domains.
arXiv Detail & Related papers (2022-02-28T18:55:19Z) - CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps.
We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z) - Training Dynamic based data filtering may not work for NLP datasets [0.0]
We study the applicability of the Area Under the Margin (AUM) metric to identify mislabelled examples in NLP datasets.
We find that mislabelled samples can be filtered using the AUM metric in NLP datasets but it also removes a significant number of correctly labeled points.
arXiv Detail & Related papers (2021-09-19T18:50:45Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - Sensitive Data Detection with High-Throughput Neural Network Models for
Financial Institutions [3.4161707164978137]
We use internal and synthetic datasets to evaluate various methods of detecting NPI (Nonpublic Personally Identifiable) information.
Character-level neural network models including CNN, LSTM, BiLSTM-CRF, and CNN-CRF are investigated on two prediction tasks.
arXiv Detail & Related papers (2020-12-17T14:11:03Z) - Predicting Themes within Complex Unstructured Texts: A Case Study on
Safeguarding Reports [66.39150945184683]
We focus on the problem of automatically identifying the main themes in a safeguarding report using supervised classification approaches.
Our results show the potential of deep learning models to simulate subject-expert behaviour even for complex tasks with limited labelled data.
arXiv Detail & Related papers (2020-10-27T19:48:23Z) - On Cross-Dataset Generalization in Automatic Detection of Online Abuse [7.163723138100273]
We show that the benign examples in the Wikipedia Detox dataset are biased towards platform-specific topics.
We identify these examples using unsupervised topic modeling and manual inspection of topics' keywords.
For a robust dataset design, we suggest applying inexpensive unsupervised methods to inspect the collected data and downsize the non-generalizable content.
arXiv Detail & Related papers (2020-10-14T21:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.