Learning from Sparse Datasets: Predicting Concrete's Strength by Machine
Learning
- URL: http://arxiv.org/abs/2004.14407v1
- Date: Wed, 29 Apr 2020 18:06:07 GMT
- Title: Learning from Sparse Datasets: Predicting Concrete's Strength by Machine
Learning
- Authors: Boya Ouyang, Yuhai Li, Yu Song, Feishu Wu, Huizi Yu, Yongzhe Wang,
Mathieu Bauchy, and Gaurav Sant
- Abstract summary: Data-driven machine learning is promising for handling the complex, non-linear, non-additive relationship between concrete mixture proportions and strength.
Here, we compare the ability of select ML algorithms to "learn" how to reliably predict concrete strength as a function of the size of the dataset.
- Score: 2.350486334305103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite enormous efforts over the last decades to establish the relationship
between concrete proportioning and strength, a robust knowledge-based model for
accurate concrete strength predictions is still lacking. As an alternative to
physical or chemical-based models, data-driven machine learning (ML) methods
offer a new solution to this problem. Although this approach is promising for
handling the complex, non-linear, non-additive relationship between concrete
mixture proportions and strength, a major limitation of ML lies in the fact
that large datasets are needed for model training. This is a concern as
reliable, consistent strength data is rather limited, especially for realistic
industrial concretes. Here, based on the analysis of a large dataset (>10,000
observations) of measured compressive strengths from industrially-produced
concretes, we compare the ability of select ML algorithms to "learn" how to
reliably predict concrete strength as a function of the size of the dataset.
Based on these results, we discuss the competition between how accurate a given
model can eventually be (when trained on a large dataset) and how much data is
actually required to train this model.
Related papers
- Predicting Large Language Model Capabilities on Closed-Book QA Tasks Using Only Information Available Prior to Training [51.60874286674908]
We focus on predicting performance on Closed-book Question Answering (CBQA) tasks, which are closely tied to pre-training data and knowledge retention.
We address three major challenges: 1) mastering the entire pre-training process, especially data construction; 2) evaluating a model's knowledge retention; and 3) predicting task-specific knowledge retention using only information available prior to training.
We introduce the SMI metric, an information-theoretic measure that quantifies the relationship between pre-training data, model size, and task-specific knowledge retention.
arXiv Detail & Related papers (2025-02-06T13:23:53Z) - Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data.
We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective.
We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Extrapolative ML Models for Copolymers [1.901715290314837]
Machine learning models have been progressively used for predicting materials properties.
These models are inherently interpolative, and their efficacy for searching candidates outside a material's known range of property is unresolved.
Here, we determine the relationship between the extrapolation ability of an ML model, the size and range of its training dataset, and its learning approach.
arXiv Detail & Related papers (2024-09-15T11:02:01Z) - PUMA: margin-based data pruning [51.12154122266251]
We focus on data pruning, where some training samples are removed based on the distance to the model classification boundary (i.e., margin)
We propose PUMA, a new data pruning strategy that computes the margin using DeepFool.
We show that PUMA can be used on top of the current state-of-the-art methodology in robustness, and it is able to significantly improve the model performance unlike the existing data pruning strategies.
arXiv Detail & Related papers (2024-05-10T08:02:20Z) - An Investigation of Smart Contract for Collaborative Machine Learning
Model Training [3.5679973993372642]
Collaborative machine learning (CML) has penetrated various fields in the era of big data.
As the training of ML models requires a massive amount of good quality data, it is necessary to eliminate concerns about data privacy.
Based on blockchain, smart contracts enable automatic execution of data preserving and validation.
arXiv Detail & Related papers (2022-09-12T04:25:01Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z) - Deep Learning with Multiple Data Set: A Weighted Goal Programming
Approach [2.7393821783237184]
Large-scale data analysis is growing at an exponential rate as data proliferates in our societies.
Deep Learning models require plenty of resources, and distributed training is needed.
This paper presents a Multicriteria approach for distributed learning.
arXiv Detail & Related papers (2021-11-27T07:10:25Z) - PyHard: a novel tool for generating hardness embeddings to support
data-centric analysis [0.38233569758620045]
PyHard produces a hardness embedding of a dataset relating the predictive performance of multiple ML models.
The user can interact with this embedding in multiple ways to obtain useful insights about data and algorithmic performance.
We show in a COVID prognosis dataset how this analysis supported the identification of pockets of hard observations that challenge ML models.
arXiv Detail & Related papers (2021-09-29T14:08:26Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.