Related papers: An Empirical Evaluation of the t-SNE Algorithm for Data Visualization in Structural Engineering

An Empirical Evaluation of the t-SNE Algorithm for Data Visualization in Structural Engineering

URL: http://arxiv.org/abs/2109.08795v1
Date: Sat, 18 Sep 2021 01:24:39 GMT
Title: An Empirical Evaluation of the t-SNE Algorithm for Data Visualization in Structural Engineering
Authors: Parisa Hajibabaee, Farhad Pourkamali-Anaraki, Mohammad Amin Hariri-Ardebili
Abstract summary: t-Distributed Neighbor Embedding (t-SNE) algorithm is used to reduce the dimensions of an earthquake related data set for visualization purposes. Synthetic Minority Oversampling Technique (SMOTE) is used to tackle the imbalanced nature of such data set. We show that using t-SNE on the imbalanced data and SMOTE on the training data set, neural network classifiers have promising results without sacrificing accuracy.
Score: 2.4493299476776773
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A fundamental task in machine learning involves visualizing high-dimensional data sets that arise in high-impact application domains. When considering the context of large imbalanced data, this problem becomes much more challenging. In this paper, the t-Distributed Stochastic Neighbor Embedding (t-SNE) algorithm is used to reduce the dimensions of an earthquake engineering related data set for visualization purposes. Since imbalanced data sets greatly affect the accuracy of classifiers, we employ Synthetic Minority Oversampling Technique (SMOTE) to tackle the imbalanced nature of such data set. We present the result obtained from t-SNE and SMOTE and compare it to the basic approaches with various aspects. Considering four options and six classification algorithms, we show that using t-SNE on the imbalanced data and SMOTE on the training data set, neural network classifiers have promising results without sacrificing accuracy. Hence, we can transform the studied scientific data into a two-dimensional (2D) space, enabling the visualization of the classifier and the resulting decision surface using a 2D plot.

Related papers

Leveraging Large Language Models to Address Data Scarcity in Machine Learning: Applications in Graphene Synthesis [0.0]
Machine learning in materials science faces challenges due to limited experimental data. We propose strategies that utilize large language models (LLMs) to enhance machine learning performance.
arXiv Detail & Related papers (2025-03-06T16:04:01Z)
Visualizing the Local Atomic Environment Features of Machine Learning Interatomic Potential [14.011925165212244]
This paper addresses the challenges of creating efficient and high-quality datasets for machine learning potential functions. We present a novel approach, termed DV-LAE (Difference Vectors based on Local Atomic Environments), which employs histogram statistics to generate difference vectors. We have validated the optimized datasets in high-temperature and high-pressure hydrogen systems.
arXiv Detail & Related papers (2025-01-27T04:08:37Z)
Two-Stage Hierarchical and Explainable Feature Selection Framework for Dimensionality Reduction in Sleep Staging [0.6216545676226375]
EEG signals play a significant role in sleep research. Due to the high-dimensional nature of EEG signal data sequences, data visualization and clustering of different sleep stages have been challenges. We propose a two-stage hierarchical and explainable feature selection framework by incorporating a feature selection algorithm to improve the performance of dimensionality reduction.
arXiv Detail & Related papers (2024-08-31T23:54:53Z)
Few-shot learning for COVID-19 Chest X-Ray Classification with Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research. Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images. We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z)
On the Convergence of Loss and Uncertainty-based Active Learning Algorithms [3.506897386829711]
We investigate the convergence rates and data sample sizes required for training a machine learning model using a gradient descent (SGD) algorithm. We present convergence results for linear classifiers and linearly separable datasets using squared hinge loss and similar training loss functions.
arXiv Detail & Related papers (2023-12-21T15:22:07Z)
Defect Classification in Additive Manufacturing Using CNN-Based Vision Processing [76.72662577101988]
This paper examines two scenarios: first, using convolutional neural networks (CNNs) to accurately classify defects in an image dataset from AM and second, applying active learning techniques to the developed classification model. This allows the construction of a human-in-the-loop mechanism to reduce the size of the data required to train and generate training data.
arXiv Detail & Related papers (2023-07-14T14:36:58Z)
Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets. The emphasis was on ensemble models constructed using the Error Correction Output Codes framework. Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z)
Minimizing the Accumulated Trajectory Error to Improve Dataset Distillation [151.70234052015948]
We propose a novel approach that encourages the optimization algorithm to seek a flat trajectory. We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory. Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7%.
arXiv Detail & Related papers (2022-11-20T15:49:11Z)
Data Scaling Laws in NMT: The Effect of Noise and Architecture [59.767899982937756]
We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT) We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
arXiv Detail & Related papers (2022-02-04T06:53:49Z)
Rank-R FNN: A Tensor-Based Learning Model for High-Order Data Classification [69.26747803963907]
Rank-R Feedforward Neural Network (FNN) is a tensor-based nonlinear learning model that imposes Canonical/Polyadic decomposition on its parameters. First, it handles inputs as multilinear arrays, bypassing the need for vectorization, and can thus fully exploit the structural information along every data dimension. We establish the universal approximation and learnability properties of Rank-R FNN, and we validate its performance on real-world hyperspectral datasets.
arXiv Detail & Related papers (2021-04-11T16:37:32Z)
Prediction of Object Geometry from Acoustic Scattering Using Convolutional Neural Networks [8.067201256886733]
The present work proposes to infer object geometry from scattering features by training convolutional neural networks. The robustness of our approach in response to data degradation is evaluated by comparing the performance of networks trained using the datasets.
arXiv Detail & Related papers (2020-10-21T00:51:14Z)
Two-Dimensional Semi-Nonnegative Matrix Factorization for Clustering [50.43424130281065]
We propose a new Semi-Nonnegative Matrix Factorization method for 2-dimensional (2D) data, named TS-NMF. It overcomes the drawback of existing methods that seriously damage the spatial information of the data by converting 2D data to vectors in a preprocessing step.
arXiv Detail & Related papers (2020-05-19T05:54:14Z)
Performance Analysis of Semi-supervised Learning in the Small-data Regime using VAEs [0.261072980439312]
In this work, we applied an existing algorithm that pre-trains a latent space representation of the data to capture the features in a lower-dimension for the small-data regime input. The fine-tuned latent space provides constant weights that are useful for classification. Here we will present the performance analysis of the VAE algorithm with different latent space sizes in the semi-supervised learning.
arXiv Detail & Related papers (2020-02-26T16:19:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.