Related papers: Learning from Sparse Datasets: Predicting Concrete's Strength by Machine Learning

Learning from Sparse Datasets: Predicting Concrete's Strength by Machine Learning

URL: http://arxiv.org/abs/2004.14407v1
Date: Wed, 29 Apr 2020 18:06:07 GMT
Title: Learning from Sparse Datasets: Predicting Concrete's Strength by Machine Learning
Authors: Boya Ouyang, Yuhai Li, Yu Song, Feishu Wu, Huizi Yu, Yongzhe Wang, Mathieu Bauchy, and Gaurav Sant
Abstract summary: Data-driven machine learning is promising for handling the complex, non-linear, non-additive relationship between concrete mixture proportions and strength. Here, we compare the ability of select ML algorithms to "learn" how to reliably predict concrete strength as a function of the size of the dataset.
Score: 2.350486334305103
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite enormous efforts over the last decades to establish the relationship between concrete proportioning and strength, a robust knowledge-based model for accurate concrete strength predictions is still lacking. As an alternative to physical or chemical-based models, data-driven machine learning (ML) methods offer a new solution to this problem. Although this approach is promising for handling the complex, non-linear, non-additive relationship between concrete mixture proportions and strength, a major limitation of ML lies in the fact that large datasets are needed for model training. This is a concern as reliable, consistent strength data is rather limited, especially for realistic industrial concretes. Here, based on the analysis of a large dataset (>10,000 observations) of measured compressive strengths from industrially-produced concretes, we compare the ability of select ML algorithms to "learn" how to reliably predict concrete strength as a function of the size of the dataset. Based on these results, we discuss the competition between how accurate a given model can eventually be (when trained on a large dataset) and how much data is actually required to train this model.

Related papers

Data Fusion of Deep Learned Molecular Embeddings for Property Prediction [44.99833362998488]
We use data fusion techniques to combine the learned molecular embeddings of various single-task models and trained a multi-task model on this combined embedding. We show that the fused, multi-task models outperform standard multi-task models for sparse datasets and can provide enhanced prediction on data-limited properties compared to single-task models.
arXiv Detail & Related papers (2025-04-09T21:40:15Z)
Predicting Large Language Model Capabilities on Closed-Book QA Tasks Using Only Information Available Prior to Training [51.60874286674908]
We focus on predicting performance on Closed-book Question Answering (CBQA) tasks, which are closely tied to pre-training data and knowledge retention. We address three major challenges: 1) mastering the entire pre-training process, especially data construction; 2) evaluating a model's knowledge retention; and 3) predicting task-specific knowledge retention using only information available prior to training. We introduce the SMI metric, an information-theoretic measure that quantifies the relationship between pre-training data, model size, and task-specific knowledge retention.
arXiv Detail & Related papers (2025-02-06T13:23:53Z)
Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data. We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective. We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z)
What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy. By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z)
Extrapolative ML Models for Copolymers [1.901715290314837]
Machine learning models have been progressively used for predicting materials properties. These models are inherently interpolative, and their efficacy for searching candidates outside a material's known range of property is unresolved. Here, we determine the relationship between the extrapolation ability of an ML model, the size and range of its training dataset, and its learning approach.
arXiv Detail & Related papers (2024-09-15T11:02:01Z)
PUMA: margin-based data pruning [51.12154122266251]
We focus on data pruning, where some training samples are removed based on the distance to the model classification boundary (i.e., margin) We propose PUMA, a new data pruning strategy that computes the margin using DeepFool. We show that PUMA can be used on top of the current state-of-the-art methodology in robustness, and it is able to significantly improve the model performance unlike the existing data pruning strategies.
arXiv Detail & Related papers (2024-05-10T08:02:20Z)
How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs) We find that Ask-LLM and Density sampling are the best methods in their respective categories. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z)
Surprisal Driven $k$-NN for Robust and Interpretable Nonparametric Learning [1.4293924404819704]
We shed new light on the traditional nearest neighbors algorithm from the perspective of information theory. We propose a robust and interpretable framework for tasks such as classification, regression, density estimation, and anomaly detection using a single model. Our work showcases the architecture's versatility by achieving state-of-the-art results in classification and anomaly detection.
arXiv Detail & Related papers (2023-11-17T00:35:38Z)
An Investigation of Smart Contract for Collaborative Machine Learning Model Training [3.5679973993372642]
Collaborative machine learning (CML) has penetrated various fields in the era of big data. As the training of ML models requires a massive amount of good quality data, it is necessary to eliminate concerns about data privacy. Based on blockchain, smart contracts enable automatic execution of data preserving and validation.
arXiv Detail & Related papers (2022-09-12T04:25:01Z)
Learning to be a Statistician: Learned Estimator for Number of Distinct Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems. In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples. We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z)
Deep Learning with Multiple Data Set: A Weighted Goal Programming Approach [2.7393821783237184]
Large-scale data analysis is growing at an exponential rate as data proliferates in our societies. Deep Learning models require plenty of resources, and distributed training is needed. This paper presents a Multicriteria approach for distributed learning.
arXiv Detail & Related papers (2021-11-27T07:10:25Z)
PyHard: a novel tool for generating hardness embeddings to support data-centric analysis [0.38233569758620045]
PyHard produces a hardness embedding of a dataset relating the predictive performance of multiple ML models. The user can interact with this embedding in multiple ways to obtain useful insights about data and algorithmic performance. We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models.
arXiv Detail & Related papers (2021-09-29T14:08:26Z)
Injecting Knowledge in Data-driven Vehicle Trajectory Predictors [82.91398970736391]
Vehicle trajectory prediction tasks have been commonly tackled from two perspectives: knowledge-driven or data-driven. In this paper, we propose to learn a "Realistic Residual Block" (RRB) which effectively connects these two perspectives. Our proposed method outputs realistic predictions by confining the residual range and taking into account its uncertainty.
arXiv Detail & Related papers (2021-03-08T16:03:09Z)
MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.