Related papers: DataOps-driven CI/CD for analytics repositories

DataOps-driven CI/CD for analytics repositories

URL: http://arxiv.org/abs/2511.12277v1
Date: Sat, 15 Nov 2025 16:09:47 GMT
Title: DataOps-driven CI/CD for analytics repositories
Authors: Dmytro Valiaiev,
Abstract summary: This perspective proposes a qualitative design for a DataOps-aligned validation framework.<n>The framework consists of five stages: Lint, Optimize, Parse, and Observe.<n>A Requirements Traceability Matrix (RTM) demonstrates how each high-level control is enforced by concrete pipeline checks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The proliferation of SQL for data processing has often occurred without the rigor of traditional software development, leading to siloed efforts, logic replication, and increased risk. This ad-hoc approach hampers data governance and makes validation nearly impossible. Organizations are adopting DataOps, a methodology combining Agile, Lean, and DevOps principles to address these challenges to treat analytics pipelines as production systems. However, a standardized framework for implementing DataOps is lacking. This perspective proposes a qualitative design for a DataOps-aligned validation framework. It introduces a DataOps Controls Scorecard, derived from a multivocal literature review, which distills key concepts into twelve testable controls. These controls are then mapped to a modular, extensible CI/CD pipeline framework designed to govern a single source of truth (SOT) SQL repository. The framework consists of five stages: Lint, Optimize, Parse, Validate, and Observe, each containing specific, automated checks. A Requirements Traceability Matrix (RTM) demonstrates how each high-level control is enforced by concrete pipeline checks, ensuring qualitative completeness. This approach provides a structured mechanism for enhancing data quality, governance, and collaboration, allowing teams to scale analytics development with transparency and control.

Related papers

LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks [4.6880826836662814]
We introduce textbfLOGIGEN, a logic-driven framework that synthesizes verifiable training data.<n>On $2$-Bench, LOGIGEN-32B(RL) achieves a textbf79.5% success rate, substantially outperforming the base model.
arXiv Detail & Related papers (2026-02-28T08:35:30Z)
Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis [23.501567675008264]
We present BAR- Mutation (Boundary-Aware Reliable NL2), a unified training framework that embeds reliability and boundary awareness directly into the generation process.<n>We employ Knowledge-Grounded Reasoning Synthesis to ensure interpretability.
arXiv Detail & Related papers (2026-01-15T11:55:01Z)
FABRIC: Framework for Agent-Based Realistic Intelligence Creation [3.940391073007047]
Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments.<n>We present a unified framework for synthesizing agentic data using only LLMs, without any human-in-the-loop supervision.
arXiv Detail & Related papers (2025-10-20T18:20:22Z)
Analyzing and Internalizing Complex Policy Documents for LLM Agents [53.14898416858099]
Large Language Model (LLM)-based agentic systems rely on in-context policy documents encoding diverse business rules.<n>This motivates developing internalization methods that embed policy documents into model priors while preserving performance.<n>We introduce CC-Gen, an agentic benchmark generator with Controllable Complexity across four levels.
arXiv Detail & Related papers (2025-10-13T16:30:07Z)
CoDA: Agentic Systems for Collaborative Data Visualization [57.270599188947294]
Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations.<n>Existing approaches, including simple single- or multi-agent systems, often oversimplify the task.<n>We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection.
arXiv Detail & Related papers (2025-10-03T17:30:16Z)
Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation [55.47971671635531]
Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA)<n>Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge.<n>Existing systems primarily rely on unstructured documents, while largely overlooking relational databases.
arXiv Detail & Related papers (2025-09-30T22:19:44Z)
Query as Test: An Intelligent Driving Test and Data Storage Method for Integrated Cockpit-Vehicle-Road Scenarios [17.75264660582999]
Existing testing methods rely on data stacking, fail to cover all edge cases, and lack flexibility.<n>"Query as Test" (QaT) shifts the focus from rigid, prescripted test cases to flexible, on-demand logical queries.<n>"Extensible Scenarios Notations" (ESN) is a novel declarative data framework.
arXiv Detail & Related papers (2025-06-27T09:59:58Z)
TD-Suite: All Batteries Included Framework for Technical Debt Classification [5.669063174637433]
TD-Suite provides a seamless end-to-end pipeline, managing everything from initial data ingestion to model training.<n>To ensure the generated models are robust and perform reliably on real-world, often imbalanced, datasets, TD-Suite incorporates critical training methodologies.<n>The framework integrates tracking and reporting of carbon emissions associated with the computationally intensive model training process.
arXiv Detail & Related papers (2025-04-15T11:31:17Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Relational Action Bases: Formalization, Effective Safety Verification, and Invariants (Extended Version) [67.99023219822564]
We introduce the general framework of relational action bases (RABs) RABs generalize existing models by lifting both restrictions. We demonstrate the effectiveness of this approach on a benchmark of data-aware business processes.
arXiv Detail & Related papers (2022-08-12T17:03:50Z)
Soundness of Data-Aware Processes with Arithmetic Conditions [8.914271888521652]
Data Petri nets (DPNs) have gained increasing popularity thanks to their ability to balance simplicity with expressiveness. The interplay of data and control-flow makes checking the correctness of such models, specifically the well-known property of soundness, crucial and challenging. We provide a framework for assessing soundness of DPNs enriched with arithmetic data conditions.
arXiv Detail & Related papers (2022-03-28T14:46:10Z)
CoCoMoT: Conformance Checking of Multi-Perspective Processes via SMT (Extended Version) [62.96267257163426]
We introduce the CoCoMoT (Computing Conformance Modulo Theories) framework. First, we show how SAT-based encodings studied in the pure control-flow setting can be lifted to our data-aware case. Second, we introduce a novel preprocessing technique based on a notion of property-preserving clustering.
arXiv Detail & Related papers (2021-03-18T20:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.