Related papers: Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings

Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings

URL: http://arxiv.org/abs/2210.14056v2
Date: Wed, 26 Oct 2022 04:03:43 GMT
Title: Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings
Authors: Ajay Chawda, Stefanie Grimm, Marius Kloft
Abstract summary: Vehicle Claims dataset consists of fraudulent insurance claims for automotive repairs. We tackle the common problem of missing benchmark datasets for anomaly detection. The dataset is evaluated on shallow and deep learning methods.
Score: 20.37092575427039
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce the Vehicle Claims dataset, consisting of fraudulent insurance claims for automotive repairs. The data belongs to the more broad category of Auditing data, which includes also Journals and Network Intrusion data. Insurance claim data are distinctively different from other auditing data (such as network intrusion data) in their high number of categorical attributes. We tackle the common problem of missing benchmark datasets for anomaly detection: datasets are mostly confidential, and the public tabular datasets do not contain relevant and sufficient categorical attributes. Therefore, a large-sized dataset is created for this purpose and referred to as Vehicle Claims (VC) dataset. The dataset is evaluated on shallow and deep learning methods. Due to the introduction of categorical attributes, we encounter the challenge of encoding them for the large dataset. As One Hot encoding of high cardinal dataset invokes the "curse of dimensionality", we experiment with GEL encoding and embedding layer for representing categorical attributes. Our work compares competitive learning, reconstruction-error, density estimation and contrastive learning approaches for Label, One Hot, GEL encoding and embedding layer to handle categorical values.

Related papers

A Dataset for Semantic Segmentation in the Presence of Unknowns [49.795683850385956]
Existing datasets allow evaluation of only knowns or unknowns - but not both. We propose a novel anomaly segmentation dataset, ISSU, that features a diverse set of anomaly inputs from cluttered real-world environments. The dataset is twice larger than existing anomaly segmentation datasets.
arXiv Detail & Related papers (2025-03-28T10:31:01Z)
Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes [22.45812577928658]
We introduce a new framework for analyzing classification datasets based on the ratios of reconstruction errors between autoencoders trained on individual classes. This analysis framework enables efficient characterization of datasets on the sample, class, and entire dataset levels.
arXiv Detail & Related papers (2024-12-03T17:29:00Z)
Attribute-Based Semantic Type Detection and Data Quality Assessment [0.5735035463793008]
This research introduces an innovative methodology centered around Attribute-Based Semantic Type Detection and Data Quality Assessment. By leveraging semantic information within attribute labels, combined with rule-based analysis and comprehensive Formats and Abbreviations dictionaries, our approach introduces a practical semantic type classification system. A comparative analysis with Sherlock, a state-of-the-art Semantic Type Detection system, shows the advantages of our approach.
arXiv Detail & Related papers (2024-10-04T09:22:44Z)
DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking [58.37644304554906]
We propose Data Retrieval with Error-corrected codes and Watermarking (DREW) DREW randomly clusters the reference dataset and injects unique error-controlled watermark keys into each cluster. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches.
arXiv Detail & Related papers (2024-06-05T01:19:44Z)
Casual Conversations v2: Designing a large consent-driven dataset to measure algorithmic bias and robustness [34.435124846961415]
Meta is working on collecting a large consent-driven dataset with a comprehensive list of categories. This paper describes our proposed design of such categories and subcategories for Casual Conversations v2.
arXiv Detail & Related papers (2022-11-10T19:06:21Z)
Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned. It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets. The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z)
Learning Semantic Segmentation from Multiple Datasets with Label Shifts [101.24334184653355]
This paper proposes UniSeg, an effective approach to automatically train models across multiple datasets with differing label spaces. Specifically, we propose two losses that account for conflicting and co-occurring labels to achieve better generalization performance in unseen domains.
arXiv Detail & Related papers (2022-02-28T18:55:19Z)
CvS: Classification via Segmentation For Small Datasets [52.821178654631254]
This paper presents CvS, a cost-effective classifier for small datasets that derives the classification labels from predicting the segmentation maps. We evaluate the effectiveness of our framework on diverse problems showing that CvS is able to achieve much higher classification results compared to previous methods when given only a handful of examples.
arXiv Detail & Related papers (2021-10-29T18:41:15Z)
Training Dynamic based data filtering may not work for NLP datasets [0.0]
We study the applicability of the Area Under the Margin (AUM) metric to identify mislabelled examples in NLP datasets. We find that mislabelled samples can be filtered using the AUM metric in NLP datasets but it also removes a significant number of correctly labeled points.
arXiv Detail & Related papers (2021-09-19T18:50:45Z)
Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare. In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples. To tackle this problem, we build a robust one-class classification framework via data refinement. We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z)
Sensitive Data Detection with High-Throughput Neural Network Models for Financial Institutions [3.4161707164978137]
We use internal and synthetic datasets to evaluate various methods of detecting NPI (Nonpublic Personally Identifiable) information. Character-level neural network models including CNN, LSTM, BiLSTM-CRF, and CNN-CRF are investigated on two prediction tasks.
arXiv Detail & Related papers (2020-12-17T14:11:03Z)
Predicting Themes within Complex Unstructured Texts: A Case Study on Safeguarding Reports [66.39150945184683]
We focus on the problem of automatically identifying the main themes in a safeguarding report using supervised classification approaches. Our results show the potential of deep learning models to simulate subject-expert behaviour even for complex tasks with limited labelled data.
arXiv Detail & Related papers (2020-10-27T19:48:23Z)
On Cross-Dataset Generalization in Automatic Detection of Online Abuse [7.163723138100273]
We show that the benign examples in the Wikipedia Detox dataset are biased towards platform-specific topics. We identify these examples using unsupervised topic modeling and manual inspection of topics' keywords. For a robust dataset design, we suggest applying inexpensive unsupervised methods to inspect the collected data and downsize the non-generalizable content.
arXiv Detail & Related papers (2020-10-14T21:47:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.