Characterizing Transactional Databases for Frequent Itemset Mining
- URL: http://arxiv.org/abs/2011.04378v1
- Date: Mon, 9 Nov 2020 12:26:14 GMT
- Title: Characterizing Transactional Databases for Frequent Itemset Mining
- Authors: Christian Lezcano, Marta Arias
- Abstract summary: This paper presents a study of the characteristics of transactional databases used in frequent itemset mining.
Our proposed list of metrics contains many of the existing metrics found in the literature, as well as new ones.
We provide a set of representative datasets based on our characterization that may be used as a benchmark safely.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a study of the characteristics of transactional databases
used in frequent itemset mining. Such characterizations have typically been
used to benchmark and understand the data mining algorithms working on these
databases. The aim of our study is to give a picture of how diverse and
representative these benchmarking databases are, both in general but also in
the context of particular empirical studies found in the literature. Our
proposed list of metrics contains many of the existing metrics found in the
literature, as well as new ones. Our study shows that our list of metrics is
able to capture much of the datasets' inner complexity and thus provides a good
basis for the characterization of transactional datasets. Finally, we provide a
set of representative datasets based on our characterization that may be used
as a benchmark safely.
Related papers
- CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - TabReD: A Benchmark of Tabular Machine Learning in-the-Wild [30.922069185335246]
We show that industry-grade datasets are underrepresented in academic benchmarks for machine learning.
We introduce TabReD, a collection of eight industry-grade datasets covering a wide range of domains.
We show that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits more common in academic benchmarks.
arXiv Detail & Related papers (2024-06-27T17:55:31Z) - On the performativity of SDG classifications in large bibliometric databases [0.0]
Large bibliometric databases have taken up the UN's Sustainable Development Goals in their respective classifications.
This work proposes using the feature of large language models (LLMs) to learn about the "data bias" injected by diverse SDG classifications into bibliometric data.
arXiv Detail & Related papers (2024-05-05T17:28:54Z) - A Comprehensive Survey on Vector Database: Storage and Retrieval
Technique, Challenge [4.579314354865921]
The approximate nearest neighbor search problem behind vector databases has been studied for a long time.
This article attempts to comprehensively review relevant algorithms to provide a general understanding of this booming research area.
arXiv Detail & Related papers (2023-10-18T04:31:06Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Text Characterization Toolkit [33.6713815884553]
We argue that deeper results analysis should become the de-facto standard when presenting new models or benchmarks.
We present a tool that researchers can use to study properties of the dataset and the influence of those properties on their models' behaviour.
arXiv Detail & Related papers (2022-10-04T16:54:11Z) - An Assessment Tool for Academic Research Managers in the Third World [125.99533416395765]
We show how the data in one of the bases can be used to infer the main index of the other one.
Since the information of SCOPUS can be freely scraped from the Web, this approach allows to infer for free the Impact Factor of publications.
arXiv Detail & Related papers (2022-09-07T14:59:25Z) - Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset
Evaluation for Text Classification [39.01740345482624]
In this paper, we ask the research question of whether all the datasets in the benchmark are necessary.
Experiments on 9 datasets and 36 systems show that several existing benchmark datasets contribute little to discriminating top-scoring systems, while those less used datasets exhibit impressive discriminative power.
arXiv Detail & Related papers (2022-05-04T15:33:00Z) - Simple multi-dataset detection [83.9604523643406]
We present a simple method for training a unified detector on multiple large-scale datasets.
We show how to automatically integrate dataset-specific outputs into a common semantic taxonomy.
Our approach does not require manual taxonomy reconciliation.
arXiv Detail & Related papers (2021-02-25T18:55:58Z) - BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift.
We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions.
Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.