Customs Import Declaration Datasets
- URL: http://arxiv.org/abs/2208.02484v3
- Date: Mon, 4 Sep 2023 05:48:50 GMT
- Title: Customs Import Declaration Datasets
- Authors: Chaeyoon Jeong and Sundong Kim and Jaewoo Park and Yeonsoo Choi
- Abstract summary: We introduce an import declaration dataset to facilitate the collaboration between domain experts in customs administrations and researchers from diverse domains.
The dataset contains 54,000 artificially generated trades with 22 key attributes.
We empirically show that more advanced algorithms can better detect fraud.
- Score: 12.306592823750385
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Given the huge volume of cross-border flows, effective and efficient control
of trade becomes more crucial in protecting people and society from illicit
trade. However, limited accessibility of the transaction-level trade datasets
hinders the progress of open research, and lots of customs administrations have
not benefited from the recent progress in data-based risk management. In this
paper, we introduce an import declaration dataset to facilitate the
collaboration between domain experts in customs administrations and researchers
from diverse domains, such as data science and machine learning. The dataset
contains 54,000 artificially generated trades with 22 key attributes, and it is
synthesized with conditional tabular GAN while maintaining correlated features.
Synthetic data has several advantages. First, releasing the dataset is free
from restrictions that do not allow disclosing the original import data. The
fabrication step minimizes the possible identity risk which may exist in trade
statistics. Second, the published data follow a similar distribution to the
source data so that it can be used in various downstream tasks. Hence, our
dataset can be used as a benchmark for testing the performance of any
classification algorithm. With the provision of data and its generation
process, we open baseline codes for fraud detection tasks, as we empirically
show that more advanced algorithms can better detect fraud.
Related papers
- Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z) - MaSS: Multi-attribute Selective Suppression for Utility-preserving Data Transformation from an Information-theoretic Perspective [10.009178591853058]
We propose a formal information-theoretic definition for this utility-preserving privacy protection problem.
We design a data-driven learnable data transformation framework that is capable of suppressing sensitive attributes from target datasets.
Results demonstrate the effectiveness and generalizability of our method under various configurations.
arXiv Detail & Related papers (2024-05-23T18:35:46Z) - The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets.
Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets.
frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z) - Harnessing Administrative Data Inventories to Create a Reliable
Transnational Reference Database for Crop Type Monitoring [0.0]
We showcase E URO C ROPS, a reference dataset for crop type classification that aggregates and harmonizes administrative data surveyed in different countries with the goal of transnational interoperability.
arXiv Detail & Related papers (2023-10-10T07:57:00Z) - Packaging code for reproducible research in the public sector [0.0]
jtstats project consists of R and Python packages for importing, processing, and visualising large and complex datasets.
Jtstats shows how domain specific packages can enable reproducible research within the public sector and beyond.
arXiv Detail & Related papers (2023-05-25T16:07:24Z) - Towards Generalizable Data Protection With Transferable Unlearnable
Examples [50.628011208660645]
We present a novel, generalizable data protection method by generating transferable unlearnable examples.
To the best of our knowledge, this is the first solution that examines data privacy from the perspective of data distribution.
arXiv Detail & Related papers (2023-05-18T04:17:01Z) - A Federated Learning Benchmark for Drug-Target Interaction [17.244787426504626]
This work proposes the application of federated learning in the drug-target interaction (DTI) domain.
It achieves up to 15% improved performance relative to the best available non-privacy preserving alternative.
Our extensive battery of experiments shows that, unlike in other domains, the non-IID data distribution in the DTI datasets does not deteriorate FL performance.
arXiv Detail & Related papers (2023-02-15T14:21:31Z) - A Comprehensive Survey of Dataset Distillation [73.15482472726555]
It has become challenging to handle the unlimited growth of data with limited computing power.
Deep learning technology has developed unprecedentedly in the last decade.
This paper provides a holistic understanding of dataset distillation from multiple aspects.
arXiv Detail & Related papers (2023-01-13T15:11:38Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Bayesian Semi-supervised Crowdsourcing [71.20185379303479]
Crowdsourcing has emerged as a powerful paradigm for efficiently labeling large datasets and performing various learning tasks.
This work deals with semi-supervised crowdsourced classification, under two regimes of semi-supervision.
arXiv Detail & Related papers (2020-12-20T23:18:51Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.