DataOps-driven CI/CD for analytics repositories
- URL: http://arxiv.org/abs/2511.12277v1
- Date: Sat, 15 Nov 2025 16:09:47 GMT
- Title: DataOps-driven CI/CD for analytics repositories
- Authors: Dmytro Valiaiev,
- Abstract summary: This perspective proposes a qualitative design for a DataOps-aligned validation framework.<n>The framework consists of five stages: Lint, Optimize, Parse, and Observe.<n>A Requirements Traceability Matrix (RTM) demonstrates how each high-level control is enforced by concrete pipeline checks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The proliferation of SQL for data processing has often occurred without the rigor of traditional software development, leading to siloed efforts, logic replication, and increased risk. This ad-hoc approach hampers data governance and makes validation nearly impossible. Organizations are adopting DataOps, a methodology combining Agile, Lean, and DevOps principles to address these challenges to treat analytics pipelines as production systems. However, a standardized framework for implementing DataOps is lacking. This perspective proposes a qualitative design for a DataOps-aligned validation framework. It introduces a DataOps Controls Scorecard, derived from a multivocal literature review, which distills key concepts into twelve testable controls. These controls are then mapped to a modular, extensible CI/CD pipeline framework designed to govern a single source of truth (SOT) SQL repository. The framework consists of five stages: Lint, Optimize, Parse, Validate, and Observe, each containing specific, automated checks. A Requirements Traceability Matrix (RTM) demonstrates how each high-level control is enforced by concrete pipeline checks, ensuring qualitative completeness. This approach provides a structured mechanism for enhancing data quality, governance, and collaboration, allowing teams to scale analytics development with transparency and control.
Related papers
- LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks [4.6880826836662814]
We introduce textbfLOGIGEN, a logic-driven framework that synthesizes verifiable training data.<n>On $2$-Bench, LOGIGEN-32B(RL) achieves a textbf79.5% success rate, substantially outperforming the base model.
arXiv Detail & Related papers (2026-02-28T08:35:30Z) - Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis [23.501567675008264]
We present BAR- Mutation (Boundary-Aware Reliable NL2), a unified training framework that embeds reliability and boundary awareness directly into the generation process.<n>We employ Knowledge-Grounded Reasoning Synthesis to ensure interpretability.
arXiv Detail & Related papers (2026-01-15T11:55:01Z) - FABRIC: Framework for Agent-Based Realistic Intelligence Creation [3.940391073007047]
Large language models (LLMs) are increasingly deployed as agents, expected to decompose goals, invoke tools, and verify results in dynamic environments.<n>We present a unified framework for synthesizing agentic data using only LLMs, without any human-in-the-loop supervision.
arXiv Detail & Related papers (2025-10-20T18:20:22Z) - Analyzing and Internalizing Complex Policy Documents for LLM Agents [53.14898416858099]
Large Language Model (LLM)-based agentic systems rely on in-context policy documents encoding diverse business rules.<n>This motivates developing internalization methods that embed policy documents into model priors while preserving performance.<n>We introduce CC-Gen, an agentic benchmark generator with Controllable Complexity across four levels.
arXiv Detail & Related papers (2025-10-13T16:30:07Z) - CoDA: Agentic Systems for Collaborative Data Visualization [57.270599188947294]
Deep research has revolutionized data analysis, yet data scientists still devote substantial time to manually crafting visualizations.<n>Existing approaches, including simple single- or multi-agent systems, often oversimplify the task.<n>We introduce CoDA, a multi-agent system that employs specialized LLM agents for metadata analysis, task planning, code generation, and self-reflection.
arXiv Detail & Related papers (2025-10-03T17:30:16Z) - Learning to Route: A Rule-Driven Agent Framework for Hybrid-Source Retrieval-Augmented Generation [55.47971671635531]
Large Language Models (LLMs) have shown remarkable performance on general Question Answering (QA)<n>Retrieval-Augmented Generation (RAG) addresses this limitation by enriching LLMs with external knowledge.<n>Existing systems primarily rely on unstructured documents, while largely overlooking relational databases.
arXiv Detail & Related papers (2025-09-30T22:19:44Z) - Query as Test: An Intelligent Driving Test and Data Storage Method for Integrated Cockpit-Vehicle-Road Scenarios [17.75264660582999]
Existing testing methods rely on data stacking, fail to cover all edge cases, and lack flexibility.<n>"Query as Test" (QaT) shifts the focus from rigid, prescripted test cases to flexible, on-demand logical queries.<n>"Extensible Scenarios Notations" (ESN) is a novel declarative data framework.
arXiv Detail & Related papers (2025-06-27T09:59:58Z) - TD-Suite: All Batteries Included Framework for Technical Debt Classification [5.669063174637433]
TD-Suite provides a seamless end-to-end pipeline, managing everything from initial data ingestion to model training.<n>To ensure the generated models are robust and perform reliably on real-world, often imbalanced, datasets, TD-Suite incorporates critical training methodologies.<n>The framework integrates tracking and reporting of carbon emissions associated with the computationally intensive model training process.
arXiv Detail & Related papers (2025-04-15T11:31:17Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Relational Action Bases: Formalization, Effective Safety Verification,
and Invariants (Extended Version) [67.99023219822564]
We introduce the general framework of relational action bases (RABs)
RABs generalize existing models by lifting both restrictions.
We demonstrate the effectiveness of this approach on a benchmark of data-aware business processes.
arXiv Detail & Related papers (2022-08-12T17:03:50Z) - Soundness of Data-Aware Processes with Arithmetic Conditions [8.914271888521652]
Data Petri nets (DPNs) have gained increasing popularity thanks to their ability to balance simplicity with expressiveness.
The interplay of data and control-flow makes checking the correctness of such models, specifically the well-known property of soundness, crucial and challenging.
We provide a framework for assessing soundness of DPNs enriched with arithmetic data conditions.
arXiv Detail & Related papers (2022-03-28T14:46:10Z) - CoCoMoT: Conformance Checking of Multi-Perspective Processes via SMT
(Extended Version) [62.96267257163426]
We introduce the CoCoMoT (Computing Conformance Modulo Theories) framework.
First, we show how SAT-based encodings studied in the pure control-flow setting can be lifted to our data-aware case.
Second, we introduce a novel preprocessing technique based on a notion of property-preserving clustering.
arXiv Detail & Related papers (2021-03-18T20:22:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.