Using Small Language Models to Reverse-Engineer Machine Learning Pipelines Structures
- URL: http://arxiv.org/abs/2601.03988v1
- Date: Wed, 07 Jan 2026 15:00:22 GMT
- Title: Using Small Language Models to Reverse-Engineer Machine Learning Pipelines Structures
- Authors: Nicolas Lacroix, Mireille Blay-Fornarino, Sébastien Mosser, Frederic Precioso,
- Abstract summary: Existing approaches either depend on non-scalable, manual labeling, or on ML classifiers that do not properly support the diversity of the domain.<n>We evaluate whether Small Language Models (SLMs) can leverage their code understanding and classification abilities to address these limitations.
- Score: 0.38180404292108383
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Background: Extracting the stages that structure Machine Learning (ML) pipelines from source code is key for gaining a deeper understanding of data science practices. However, the diversity caused by the constant evolution of the ML ecosystem (e.g., algorithms, libraries, datasets) makes this task challenging. Existing approaches either depend on non-scalable, manual labeling, or on ML classifiers that do not properly support the diversity of the domain. These limitations highlight the need for more flexible and reliable solutions. Objective: We evaluate whether Small Language Models (SLMs) can leverage their code understanding and classification abilities to address these limitations, and subsequently how they can advance our understanding of data science practices. Method: We conduct a confirmatory study based on two reference works selected for their relevance regarding current state-of-the-art's limitations. First, we compare several SLMs using Cochran's Q test. The best-performing model is then evaluated against the reference studies using two distinct McNemar's tests. We further analyze how variations in taxonomy definitions affect performance through an additional Cochran's Q test. Finally, a goodness-of-fit analysis is conducted using Pearson's chi-squared tests to compare our insights on data science practices with those from prior studies.
Related papers
- Can We Classify Flaky Tests Using Only Test Code? An LLM-Based Empirical Study [40.93176986225226]
Flaky tests yield inconsistent results when they are repeatedly executed on the same code revision.<n>Previous work evaluated approaches to train machine learning models to classify flaky tests based on identifiers in the test code.
arXiv Detail & Related papers (2026-02-05T09:15:09Z) - CAuSE: Decoding Multimodal Classifiers using Faithful Natural Language Explanation [46.9286703847151]
We propose CAuSE (Causal Abstraction under Simulated Explanations), a novel framework to generate faithful NLEs for any pretrained multimodal classifier.<n>We demonstrate that CAuSE generalizes across datasets and models through extensive empirical evaluations.<n>We further validate this through a redesigned metric for measuring causal faithfulness in multimodal settings.
arXiv Detail & Related papers (2025-12-07T12:15:21Z) - Utilizing Large Language Models for Machine Learning Explainability [37.31918138232927]
This study explores the explainability capabilities of large language models (LLMs), when employed to autonomously generate machine learning (ML) solutions.<n>Three state-of-the-art LLMs are prompted to design training pipelines for four common classifiers: Random Forest, XGBoost, Multilayer Perceptron, and Long Short-Term Memory networks.<n>The generated models are evaluated in terms of predictive performance (recall, precision, and F1-score) and explainability using SHAP (SHapley Additive exPlanations)
arXiv Detail & Related papers (2025-10-08T11:46:23Z) - SciML Agents: Write the Solver, Not the Solution [69.5021018644143]
We introduce two new datasets: a diagnostic dataset of adversarial "misleading" problems; and a large-scale benchmark of 1,000 diverse ODE tasks.<n>We evaluate open- and closed-source LLM models along two axes: (i) unguided versus guided prompting with domain-specific knowledge; and (ii) off-the-shelf versus fine-tuned variants.<n>Preliminary results indicate that careful prompting and fine-tuning can yield a specialized LLM agent capable of reliably solving simple ODE problems.
arXiv Detail & Related papers (2025-09-12T02:53:57Z) - Semantic Source Code Segmentation using Small and Large Language Models [2.5748316361772963]
This paper introduces an automated, domain-specific approach for research R code segmentation using Large and Small Language Models (LLMs/SLMs)<n>We explore two distinct approaches: line-by-line analysis with context and range-based segment determination.<n>Our results show that context-based line-by-line analysis is superior over range-based segmentation.
arXiv Detail & Related papers (2025-07-11T19:49:59Z) - An Analysis of LLM Fine-Tuning and Few-Shot Learning for Flaky Test Detection and Classification [1.9336815376402723]
Flaky tests exhibit non-deterministic behavior during execution.<n>Flaky test detection and classification is challenging due to the variability in test behavior.
arXiv Detail & Related papers (2025-02-04T20:54:51Z) - A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification [51.35500308126506]
Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels.
We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types.
arXiv Detail & Related papers (2024-07-16T23:17:36Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Gone Fishing: Neural Active Learning with Fisher Embeddings [55.08537975896764]
There is an increasing need for active learning algorithms that are compatible with deep neural networks.
This article introduces BAIT, a practical representation of tractable, and high-performing active learning algorithm for neural networks.
arXiv Detail & Related papers (2021-06-17T17:26:31Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z) - Transfer Learning or Self-supervised Learning? A Tale of Two Pretraining
Paradigms [36.04356511882304]
Self-supervised learning (SSL) has demonstrated promising results on a wide range of applications.
There has not been a clear understanding on what properties of data and tasks render one approach outperforms the other.
arXiv Detail & Related papers (2020-06-19T05:21:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.