Related papers: Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle

Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle

URL: http://arxiv.org/abs/2408.02498v2
Date: Fri, 15 Nov 2024 20:57:40 GMT
Title: Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle
Authors: Rolando Garcia, Pragya Kallanagoudar, Chithra Anand, Sarah E. Chasins, Joseph M. Hellerstein, Erin Michelle Turner Kerrison, Aditya G. Parameswaran,
Abstract summary: We present techniques to harvest and query arbitrary metadata from machine learning pipelines. We show how hindsight logging allows such statements to be added and executed post-hoc. This is done in a "metadata later style" off the critical path of agile development.
Score: 9.424552130799661
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper we present techniques to incrementally harvest and query arbitrary metadata from machine learning pipelines, without disrupting agile practices. We center our approach on the developer-favored technique for generating metadata -- log statements -- leveraging the fact that logging creates context. We show how hindsight logging allows such statements to be added and executed post-hoc, without requiring developer foresight. Relational views of incomplete metadata can be queried to dynamically materialize new metadata in bulk and on demand across multiple versions of workflows. This is done in a "metadata later" style, off the critical path of agile development. We realize these ideas in a system called FlorDB and demonstrate how the data context framework covers a range of both ad-hoc metadata as well as special cases treated today by bespoke feature stores and model repositories. Through a usage scenario -- including both ML and human feedback -- we illustrate how the component techniques come together to resolve classic software engineering trade-offs between agility and discipline.

Related papers

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z)
Event Extraction in Large Language Model [99.94321497574805]
We argue that EE should be viewed as a system component that provides a cognitive scaffold for LLM centered solutions.<n>This survey covers EE in text and multimodal settings, organizing tasks and taxonomy, tracing method evolution from rule based and neural models to instruction driven and generative frameworks.
arXiv Detail & Related papers (2025-12-22T16:22:14Z)
Ontology-Driven Model-to-Model Transformation of Workflow Specifications [0.8921166277011348]
Proprietary languages such as Smart Forms & Smart Flow hamper interoperability and reuse because they lock process knowledge into closed formats.<n>We introduce an ontology-driven model-to-model pipeline that supports domain-specific definitions to Business Process Model and Notation.<n>We instantiated the pipeline for Superior Técnico (IST)'s Smart Forms & Smart Flow and implemented a converter that produces standard-compliant BPMN diagrams.
arXiv Detail & Related papers (2025-11-17T18:16:19Z)
Language Modeling with Learned Meta-Tokens [15.860245999620409]
This work introduces a novel approach using meta-tokens, special tokens injected during pre-training, along with a dedicated meta-attention mechanism to guide LMs to use these tokens.<n>We find that data-efficient language model pre-training on fewer than 100B tokens utilizing meta-tokens achieves strong performance on these tasks after fine-tuning.
arXiv Detail & Related papers (2025-09-18T17:38:48Z)
Leveraging Machine Learning and Enhanced Parallelism Detection for BPMN Model Generation from Text [75.77648333476776]
This paper introduces an automated pipeline for extracting BPMN models from text.<n>A key contribution of this work is the introduction of a newly annotated dataset.<n>We augment the dataset with 15 newly annotated documents containing 32 parallel gateways for model training.
arXiv Detail & Related papers (2025-07-11T07:25:55Z)
Bootstrap Your Own Context Length [74.61148597039248]
We introduce a bootstrapping approach to train long-context language models by exploiting their short-context capabilities only. The proposed data synthesis workflow requires only a short-context language model, a text retriever, and a document collection. We conduct experiments with the open-source Llama-3 family of models and demonstrate that our method can successfully extend the context length to up to 1M tokens.
arXiv Detail & Related papers (2024-12-25T10:08:54Z)
The Compressor-Retriever Architecture for Language Model OS [20.56093501980724]
This paper explores the concept of using a language model as the core component of an operating system (OS) A key challenge in realizing such an LM OS is managing the life-long context and ensuring statefulness across sessions. We introduce compressor-retriever, a model-agnostic architecture designed for life-long context management.
arXiv Detail & Related papers (2024-09-02T23:28:15Z)
An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources. We propose a data processing framework that integrates a Processing Module and an Analyzing Module. The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z)
Contextualization Distillation from Large Language Model for Knowledge Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks. Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments. Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z)
A Topical Approach to Capturing Customer Insight In Social Media [0.0]
This research addresses the challenge of fully unsupervised topic extraction in noisy, Big Data contexts. We present three approaches we built on the Variational Autoencoder framework. We show that our models achieve equal to better performance than state-of-the-art methods.
arXiv Detail & Related papers (2023-07-14T11:15:28Z)
Pathway: a fast and flexible unified stream data processing framework for analytical and Machine Learning applications [7.850979932441607]
Pathway is a new unified data processing framework that can run workloads on both bounded and unbounded data streams. We describe the system and present benchmarking results which demonstrate its capabilities in both batch and streaming contexts.
arXiv Detail & Related papers (2023-07-12T08:27:37Z)
Learning to Learn from APIs: Black-Box Data-Free Meta-Learning [95.41441357931397]
Data-free meta-learning (DFML) aims to enable efficient learning of new tasks by meta-learning from a collection of pre-trained models without access to the training data. Existing DFML work can only meta-learn from (i) white-box and (ii) small-scale pre-trained models. We propose a Bi-level Data-free Meta Knowledge Distillation (BiDf-MKD) framework to transfer more general meta knowledge from a collection of black-box APIs to one single model.
arXiv Detail & Related papers (2023-05-28T18:00:12Z)
Metadata Representations for Queryable ML Model Zoos [73.24799582702326]
Machine learning (ML) practitioners and organizations are building model zoos of pre-trained models, containing metadata describing properties of the models. The metatada is currently not standardised; its expressivity is limited; and there is no way to store and query it. In this paper, we advocate for standardized ML model meta-data representation and management, proposing a toolkit supported to help practitioners manage and query that metadata.
arXiv Detail & Related papers (2022-07-19T15:04:14Z)
Scanflow: A multi-graph framework for Machine Learning workflow management, supervision, and debugging [0.0]
We propose a novel containerized directed graph framework to support end-to-end Machine Learning workflow management. The framework allows defining and deploying ML in containers, tracking their metadata, checking their behavior in production, and improving the models by using both learned and human-provided knowledge.
arXiv Detail & Related papers (2021-11-04T17:01:12Z)
Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time. The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z)
Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems. We generate document representations that capture both text and metadata artifacts in a task manner. Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)
Petri Nets with Parameterised Data: Modelling and Verification (Extended Version) [67.99023219822564]
We introduce and study an extension of coloured Petri nets, called catalog-nets, providing two key features to capture this type of processes. We show that fresh-value injection is a particularly complex feature to handle, and discuss strategies to tame it.
arXiv Detail & Related papers (2020-06-11T17:26:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.