Related papers: Vamsa: Automated Provenance Tracking in Data Science Scripts

Vamsa: Automated Provenance Tracking in Data Science Scripts

URL: http://arxiv.org/abs/2001.01861v2
Date: Thu, 30 Jul 2020 16:58:22 GMT
Title: Vamsa: Automated Provenance Tracking in Data Science Scripts
Authors: Mohammad Hossein Namaki, Avrilia Floratou, Fotis Psallidas, Subru Krishnan, Ashvin Agrawal, Yinghui Wu, Yiwen Zhu and Markus Weimer
Abstract summary: We introduce the ML provenance tracking problem. We discuss the challenges in capturing such information in the context of Python. We present Vamsa, a modular system that extracts provenance from Python scripts without requiring any changes to the users' code.
Score: 17.53546311589593
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: There has recently been a lot of ongoing research in the areas of fairness, bias and explainability of machine learning (ML) models due to the self-evident or regulatory requirements of various ML applications. We make the following observation: All of these approaches require a robust understanding of the relationship between ML models and the data used to train them. In this work, we introduce the ML provenance tracking problem: the fundamental idea is to automatically track which columns in a dataset have been used to derive the features/labels of an ML model. We discuss the challenges in capturing such information in the context of Python, the most common language used by data scientists. We then present Vamsa, a modular system that extracts provenance from Python scripts without requiring any changes to the users' code. Using 26K real data science scripts, we verify the effectiveness of Vamsa in terms of coverage, and performance. We also evaluate Vamsa's accuracy on a smaller subset of manually labeled data. Our analysis shows that Vamsa's precision and recall range from 90.4% to 99.1% and its latency is in the order of milliseconds for average size scripts. Drawing from our experience in deploying ML models in production, we also present an example in which Vamsa helps automatically identify models that are affected by data corruption issues.

Related papers

Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z)
LeakageDetector: An Open Source Data Leakage Analysis Tool in Machine Learning Pipelines [3.5453450990441238]
Our work seeks to enable Machine Learning (ML) engineers to write better code by helping them find and fix instances of Data Leakage in their models. ML developers must carefully separate their data into training, evaluation, and test sets to avoid introducing Data Leakage into their code. In this paper, we develop LEAKAGEDETECTOR, a Python plugin that identifies instances of Data Leakage in ML code and provides suggestions on how to remove the leakage.
arXiv Detail & Related papers (2025-03-18T20:53:44Z)
A General Framework for Data-Use Auditing of ML Models [47.369572284751285]
We propose a general method to audit an ML model for the use of a data-owner's data in training. We show the effectiveness of our proposed framework by applying it to audit data use in two types of ML models.
arXiv Detail & Related papers (2024-07-21T09:32:34Z)
Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
Accelerated Cloud for Artificial Intelligence (ACAI) [24.40451195277244]
We propose an end-to-end cloud-based machine learning platform, Accelerated Cloud for AI (ACAI) ACAI enables cloud-based storage of indexed, labeled, and searchable data, as well as automatic resource provisioning, job scheduling, and experiment tracking. We show that our auto-provisioner produces a 1.7x speed-up and 39% cost reduction, and our system reduces experiment time for ML scientists by 20% on typical ML use cases.
arXiv Detail & Related papers (2024-01-30T07:09:48Z)
Towards Automatic Translation of Machine Learning Visual Insights to Analytical Assertions [23.535630175567146]
We present our vision for developing an automated tool capable of translating visual properties observed in Machine Learning (ML) visualisations into Python assertions. The tool aims to streamline the process of manually verifying these visualisations in the ML development cycle, which is critical as real-world data and assumptions often change post-deployment.
arXiv Detail & Related papers (2024-01-15T14:11:59Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems. We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z)
Tribuo: Machine Learning with Provenance in Java [0.0]
We introduce Tribuo, a Java ML library that integrates training, type-safety, runtime checking, and automatic recording into a single framework. All Tribuo's models and evaluations record the full processing pipeline for input data, along with the training algorithms.
arXiv Detail & Related papers (2021-10-06T19:10:50Z)
FreaAI: Automated extraction of data slices to test machine learning models [2.475112368179548]
We show the feasibility of automatically extracting feature models that result in explainable data slices over which the ML solution under-performs. Our novel technique, IBM FreaAI aka FreaAI, extracts such slices from structured ML test data or any other labeled data.
arXiv Detail & Related papers (2021-08-12T09:21:16Z)
Did the Model Change? Efficiently Assessing Machine Learning API Shifts [24.342984907651505]
Machine learning (ML) prediction APIs are increasingly widely used. They can change over time due to model updates or retraining. It is often not clear to the user if and how the ML model has changed.
arXiv Detail & Related papers (2021-07-29T17:41:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.