Modern Neural Networks for Small Tabular Datasets: The New Default for Field-Scale Digital Soil Mapping?
- URL: http://arxiv.org/abs/2508.09888v1
- Date: Wed, 13 Aug 2025 15:46:12 GMT
- Title: Modern Neural Networks for Small Tabular Datasets: The New Default for Field-Scale Digital Soil Mapping?
- Authors: Viacheslav Barkov, Jonas Schmidinger, Robin Gebbers, Martin Atzmueller,
- Abstract summary: We introduce a benchmark that evaluates state-of-the-art artificial neural networks (ANNs) for predictive soil modeling.<n>Our evaluation encompasses 31 field- and farm-scale datasets containing 30 to 460 samples and three critical soil properties.<n>We recommend the adoption of modern ANNs for field-scale PSM and propose TabPFN as the new default choice in the toolkit of every pedometrician.
- Score: 0.0937465283958018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the field of pedometrics, tabular machine learning is the predominant method for predicting soil properties from remote and proximal soil sensing data, forming a central component of digital soil mapping. At the field-scale, this predictive soil modeling (PSM) task is typically constrained by small training sample sizes and high feature-to-sample ratios in soil spectroscopy. Traditionally, these conditions have proven challenging for conventional deep learning methods. Classical machine learning algorithms, particularly tree-based models like Random Forest and linear models such as Partial Least Squares Regression, have long been the default choice for field-scale PSM. Recent advances in artificial neural networks (ANN) for tabular data challenge this view, yet their suitability for field-scale PSM has not been proven. We introduce a comprehensive benchmark that evaluates state-of-the-art ANN architectures, including the latest multilayer perceptron (MLP)-based models (TabM, RealMLP), attention-based transformer variants (FT-Transformer, ExcelFormer, T2G-Former, AMFormer), retrieval-augmented approaches (TabR, ModernNCA), and an in-context learning foundation model (TabPFN). Our evaluation encompasses 31 field- and farm-scale datasets containing 30 to 460 samples and three critical soil properties: soil organic matter or soil organic carbon, pH, and clay content. Our results reveal that modern ANNs consistently outperform classical methods on the majority of tasks, demonstrating that deep learning has matured sufficiently to overcome the long-standing dominance of classical machine learning for PSM. Notably, TabPFN delivers the strongest overall performance, showing robustness across varying conditions. We therefore recommend the adoption of modern ANNs for field-scale PSM and propose TabPFN as the new default choice in the toolkit of every pedometrician.
Related papers
- Horizon Activation Mapping for Neural Networks in Time Series Forecasting [0.0]
Horizon Activation Mapping (HAM) is a visual interpretability inspired by grad-CAM.<n>Ham can be used for granular model selection, validation set choices and comparisons across different neural network model families.
arXiv Detail & Related papers (2026-01-05T13:21:30Z) - Predicting California Bearing Ratio with Ensemble and Neural Network Models: A Case Study from Turkiye [0.0]
The California Bearing Ratio (CBR) is a key geotechnical indicator used to assess the load-bearing capacity of subgrade soils.<n>Traditional tests are often time-consuming, costly, and can be impractical, particularly for large-scale or diverse soil profiles.<n>Recent progress in artificial intelligence, especially machine learning (ML), has enabled data-driven approaches for modeling complex soil behavior with greater speed and precision.<n>This study introduces a comprehensive ML framework for CBR prediction using a dataset of 382 soil samples collected from various geoclimatic regions in Trkiye.
arXiv Detail & Related papers (2025-12-09T08:09:55Z) - iLTM: Integrated Large Tabular Model [41.81329403540607]
iLTM is an integrated Large Tabular Model that unifies tree-derived embeddings, dimensionality-agnostic representations, a meta-trained hypernetwork, multilayer perceptrons, and retrieval within a single architecture.
arXiv Detail & Related papers (2025-11-20T00:20:16Z) - Benchmarking Foundation Models for Mitotic Figure Classification [0.37334049820361814]
Self-supervised learning techniques have enabled the use of vast amounts of unlabeled data to train large-scale neural networks.<n>In this work, we investigate the use of foundation models for mitotic figure classification.<n>We compare all models against end-to-end-trained baselines, both CNNs and Vision Transformers.
arXiv Detail & Related papers (2025-08-06T13:30:40Z) - Topology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition [63.55828203989405]
We introduce a novel Topology-Aware Modeling (TAM) framework for Sim2Real UDA on object point clouds.<n>Our approach mitigates the domain gap by leveraging global spatial topology, characterized by low-level, high-frequency 3D structures.<n>We propose an advanced self-training strategy that combines cross-domain contrastive learning with self-training.
arXiv Detail & Related papers (2025-06-26T11:53:59Z) - Revisiting Nearest Neighbor for Tabular Data: A Deep Tabular Baseline Two Decades Later [76.66498833720411]
We introduce a differentiable version of $K$-nearest neighbors (KNN) originally designed to learn a linear projection to capture semantic similarities between instances.<n>Surprisingly, our implementation of NCA using SGD and without dimensionality reduction already achieves decent performance on tabular data.<n>We conclude our paper by analyzing the factors behind these improvements, including loss functions, prediction strategies, and deep architectures.
arXiv Detail & Related papers (2024-07-03T16:38:57Z) - A Closer Look at Deep Learning Methods on Tabular Datasets [78.61845513154502]
We present an extensive study on TALENT, a collection of 300+ datasets spanning broad ranges of size.<n>Our evaluation shows that ensembling benefits both tree-based and neural approaches.
arXiv Detail & Related papers (2024-07-01T04:24:07Z) - Is Deep Learning finally better than Decision Trees on Tabular Data? [19.657605376506357]
Tabular data is a ubiquitous data modality due to its versatility and ease of use in many real-world applications.<n>Recent studies on data offer a unique perspective on the limitations of neural networks in this domain.<n>Our study categorizes ten state-of-the-art models based on their underlying learning paradigm.
arXiv Detail & Related papers (2024-02-06T12:59:02Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - SSL-SoilNet: A Hybrid Transformer-based Framework with Self-Supervised Learning for Large-scale Soil Organic Carbon Prediction [2.554658234030785]
This study introduces a novel approach that aims to learn the geographical link between multimodal features via self-supervised contrastive learning.
The proposed approach has undergone rigorous testing on two distinct large-scale datasets.
arXiv Detail & Related papers (2023-08-07T13:44:44Z) - HSurf-Net: Normal Estimation for 3D Point Clouds by Learning Hyper
Surfaces [54.77683371400133]
We propose a novel normal estimation method called HSurf-Net, which can accurately predict normals from point clouds with noise and density variations.
Experimental results show that our HSurf-Net achieves the state-of-the-art performance on the synthetic shape dataset.
arXiv Detail & Related papers (2022-10-13T16:39:53Z) - Contrastive Neighborhood Alignment [81.65103777329874]
We present Contrastive Neighborhood Alignment (CNA), a manifold learning approach to maintain the topology of learned features.
The target model aims to mimic the local structure of the source representation space using a contrastive loss.
CNA is illustrated in three scenarios: manifold learning, where the model maintains the local topology of the original data in a dimension-reduced space; model distillation, where a small student model is trained to mimic a larger teacher; and legacy model update, where an older model is replaced by a more powerful one.
arXiv Detail & Related papers (2022-01-06T04:58:31Z) - Surface Warping Incorporating Machine Learning Assisted Domain
Likelihood Estimation: A New Paradigm in Mine Geology Modelling and
Automation [68.8204255655161]
A Bayesian warping technique has been proposed to reshape modeled surfaces based on geochemical and spatial constraints imposed by newly acquired blasthole data.
This paper focuses on incorporating machine learning in this warping framework to make the likelihood generalizable.
Its foundation is laid by a Bayesian computation in which the geological domain likelihood given the chemistry, p(g|c) plays a similar role to p(y(c)|g.
arXiv Detail & Related papers (2021-02-15T10:37:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.