Selecting Datasets for Evaluating an Enhanced Deep Learning Framework
- URL: http://arxiv.org/abs/2109.10442v1
- Date: Tue, 21 Sep 2021 22:09:30 GMT
- Title: Selecting Datasets for Evaluating an Enhanced Deep Learning Framework
- Authors: Kudakwashe Dandajena, Isabella M. Venter, Mehrdad Ghaziasgar and Reg
Dodds
- Abstract summary: This work deals with the steps followed to select suitable datasets characterised by discrete irregular sequential patterns.
The developed framework was then tested using the most appropriate datasets.
The research concluded that the financial market-daily currency exchange domain is the most suitable kind of data set for the evaluation of the designed deep learning framework.
- Score: 0.2999888908665658
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: A framework was developed to address limitations associated with existing
techniques for analysing sequences. This work deals with the steps followed to
select suitable datasets characterised by discrete irregular sequential
patterns. To identify, select, explore and evaluate which datasets from various
sources extracted from more than 400 research articles, an interquartile range
method for outlier calculation and a qualitative Billauer's algorithm was
adapted to provide periodical peak detection in such datasets.
The developed framework was then tested using the most appropriate datasets.
The research concluded that the financial market-daily currency exchange
domain is the most suitable kind of data set for the evaluation of the designed
deep learning framework, as it provides high levels of discrete irregular
patterns.
Related papers
- Cherry-Picking in Time Series Forecasting: How to Select Datasets to Make Your Model Shine [0.35998666903987897]
We investigate the impact of dataset selection bias, particularly the practice of cherry-picking datasets, on the performance evaluation of forecasting methods.
By selectively choosing just four datasets, 46% of methods could be deemed best in class, and 77% could rank within the top three.
Our results indicate that, when empirically validating forecasting algorithms on a subset of the benchmarks, increasing the number of datasets tested from 3 to 6 reduces the risk of incorrectly identifying an algorithm as the best one by approximately 40%.
arXiv Detail & Related papers (2024-12-19T01:34:17Z) - Unsupervised feature selection algorithm framework based on neighborhood interval disturbance fusion [8.869067846581943]
The universality and stability of many unsupervised feature selection algorithms are very low and greatly affected by the dataset structure.
This paper attempts to preprocess the data set and use an interval method to approximate the data set, experimentally verifying the advantages and disadvantages of the new interval data set.
arXiv Detail & Related papers (2024-10-20T05:39:04Z) - SKADA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities [55.87169702896249]
Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift.
We present a complete and fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment.
Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications.
arXiv Detail & Related papers (2024-07-16T12:52:29Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - Exploring Data Redundancy in Real-world Image Classification through
Data Selection [20.389636181891515]
Deep learning models often require large amounts of data for training, leading to increased costs.
We present two data valuation metrics based on Synaptic Intelligence and gradient norms, respectively, to study redundancy in real-world image data.
Online and offline data selection algorithms are then proposed via clustering and grouping based on the examined data values.
arXiv Detail & Related papers (2023-06-25T03:31:05Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - A review of systematic selection of clustering algorithms and their
evaluation [0.0]
This paper aims to identify a systematic selection logic for clustering algorithms and corresponding validation concepts.
The goal is to enable potential users to choose an algorithm that fits best to their needs and the properties of their underlying data clustering problem.
arXiv Detail & Related papers (2021-06-24T07:01:46Z) - Finding the Homology of Decision Boundaries with Active Learning [26.31885403636642]
We propose an active learning algorithm to recover the homology of decision boundaries.
Our algorithm sequentially and adaptively selects which samples it requires the labels of.
Experiments on several datasets show the sample complexity improvement in recovering the homology.
arXiv Detail & Related papers (2020-11-19T04:22:06Z) - Topology-based Clusterwise Regression for User Segmentation and Demand
Forecasting [63.78344280962136]
Using a public and a novel proprietary data set of commercial data, this research shows that the proposed system enables analysts to both cluster their user base and plan demand at a granular level.
This work seeks to introduce TDA-based clustering of time series and clusterwise regression with matrix factorization methods as viable tools for the practitioner.
arXiv Detail & Related papers (2020-09-08T12:10:10Z) - CONSAC: Robust Multi-Model Fitting by Conditional Sample Consensus [62.86856923633923]
We present a robust estimator for fitting multiple parametric models of the same form to noisy measurements.
In contrast to previous works, which resorted to hand-crafted search strategies for multiple model detection, we learn the search strategy from data.
For self-supervised learning of the search, we evaluate the proposed algorithm on multi-homography estimation and demonstrate an accuracy that is superior to state-of-the-art methods.
arXiv Detail & Related papers (2020-01-08T17:37:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.