Navigating the challenges in creating complex data systems: a
development philosophy
- URL: http://arxiv.org/abs/2210.13191v1
- Date: Fri, 21 Oct 2022 14:28:53 GMT
- Title: Navigating the challenges in creating complex data systems: a
development philosophy
- Authors: S\"oren Dittmer, Michael Roberts, Julian Gilbey, Ander Biguri,
AIX-COVNET Collaboration, Jacobus Preller, James H.F. Rudd, John A.D. Aston,
Carola-Bibiane Sch\"onlieb
- Abstract summary: Perverse incentives and a lack of widespread software engineering skills are among many root causes.
We advocate two key development philosophies, namely that one should incrementally grow -- not biphasically plan and build -- DSSs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this perspective, we argue that despite the democratization of powerful
tools for data science and machine learning over the last decade, developing
the code for a trustworthy and effective data science system (DSS) is getting
harder. Perverse incentives and a lack of widespread software engineering (SE)
skills are among many root causes we identify that naturally give rise to the
current systemic crisis in reproducibility of DSSs. We analyze why SE and
building large complex systems is, in general, hard. Based on these insights,
we identify how SE addresses those difficulties and how we can apply and
generalize SE methods to construct DSSs that are fit for purpose. We advocate
two key development philosophies, namely that one should incrementally grow --
not biphasically plan and build -- DSSs, and one should always employ two types
of feedback loops during development: one which tests the code's correctness
and another that evaluates the code's efficacy.
Related papers
- SE Research is a Complex Ecosystem: Isolated Fixes Keep Failing -- and Systems Thinking Shows Why [7.917868855980384]
The software engineering research community is productive, yet it faces a constellation of challenges.<n>These issues arise from deep structural dynamics within the research ecosystem itself.<n>We sketch such a framework drawing on ideas from complex systems, ecosystems, and theory of change.
arXiv Detail & Related papers (2026-01-22T23:32:06Z) - Let the Barbarians In: How AI Can Accelerate Systems Performance Research [80.43506848683633]
We term this iterative cycle of generation, evaluation, and refinement AI-Driven Research for Systems.<n>We demonstrate that ADRS-generated solutions can match or even outperform human state-of-the-art designs.
arXiv Detail & Related papers (2025-12-16T18:51:23Z) - Barbarians at the Gate: How AI is Upending Systems Research [58.95406995634148]
We argue that systems research, long focused on designing and evaluating new performance-oriented algorithms, is particularly well-suited for AI-driven solution discovery.<n>We term this approach as AI-Driven Research for Systems ( ADRS), which iteratively generates, evaluates, and refines solutions.<n>Our results highlight both the disruptive potential and the urgent need to adapt systems research practices in the age of AI.
arXiv Detail & Related papers (2025-10-07T17:49:24Z) - A Survey on Code Generation with LLM-based Agents [61.474191493322415]
Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm.<n>LLMs are characterized by three core features.<n>This paper presents a systematic survey of the field of LLM-based code generation agents.
arXiv Detail & Related papers (2025-07-31T18:17:36Z) - Identification and Optimization of Redundant Code Using Large Language Models [0.0]
Redundant code is a persistent challenge in software development that makes systems harder to maintain, scale, and update.<n>This research aims to identify recurring patterns of redundancy and analyze their underlying causes, such as outdated practices or insufficient awareness of best coding principles.
arXiv Detail & Related papers (2025-05-07T00:44:32Z) - SYMBIOSIS: Systems Thinking and Machine Intelligence for Better Outcomes in Society [0.0]
SYMBIOSIS is an AI-powered framework and platform designed to make Systems Thinking accessible for addressing societal challenges.
To address this, we developed a generative co-pilot that translates complex systems representations into natural language.
SYMBIOSIS aims to serve as a foundational step to unlock future research into responsible and society-centered AI.
arXiv Detail & Related papers (2025-03-07T17:07:26Z) - From System 1 to System 2: A Survey of Reasoning Large Language Models [72.99519859756602]
Foundational Large Language Models excel at fast decision-making but lack depth for complex reasoning.
OpenAI's o1/o3 and DeepSeek's R1 have demonstrated expert-level performance in fields such as mathematics and coding.
arXiv Detail & Related papers (2025-02-24T18:50:52Z) - Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights [0.0]
This paper introduces a novel scoring mechanism called the SBC score.
It is based on a reverse generation technique that leverages the natural language generation capabilities of Large Language Models.
Unlike direct code analysis, our approach reconstructs system requirements from AI-generated code and compares them with the original specifications.
arXiv Detail & Related papers (2025-02-11T01:12:11Z) - Code-Survey: An LLM-Driven Methodology for Analyzing Large-Scale Codebases [3.8153349016958074]
We introduce Code-Survey, the first LLM-driven methodology designed to explore and analyze large-scales.
By carefully designing surveys, Code-Survey transforms unstructured data, such as commits, emails, into organized, structured, and analyzable datasets.
This enables quantitative analysis of complex software evolution and uncovers valuable insights related to design, implementation, maintenance, reliability, and security.
arXiv Detail & Related papers (2024-09-24T17:08:29Z) - Making Software Development More Diverse and Inclusive: Key Themes, Challenges, and Future Directions [50.545824691484796]
We identify six themes around the theme challenges and opportunities to improve Software Developer Diversity and Inclusion (SDDI)
We identify benefits, harms, and future research directions for the four main themes.
We discuss the remaining two themes, Artificial Intelligence & SDDI and AI & Computer Science education, which have a cross-cutting effect on the other themes.
arXiv Detail & Related papers (2024-04-10T16:18:11Z) - DLAS: An Exploration and Assessment of the Deep Learning Acceleration
Stack [3.7873597471903935]
We combine machine learning and systems techniques within the Deep Learning Acceleration Stack (DLAS)
We evaluate the impact on accuracy and inference time when varying different parameters of DLAS across two datasets.
Overall we make 13 key observations, including that speedups provided by compression techniques are very hardware dependent.
arXiv Detail & Related papers (2023-11-15T12:26:31Z) - Rust for Embedded Systems: Current State, Challenges and Open Problems (Extended Report) [6.414678578343769]
This paper performs the first systematic study to holistically understand the current state and challenges of using RUST for embedded systems.
We collected a dataset of 2,836 RUST embedded software spanning various categories and 5 Static Application Security Testing ( SAST) tools.
We found that existing RUST software support is inadequate, SAST tools cannot handle certain features of RUST embedded software, resulting in failures, and the prevalence of advanced types in existing RUST software makes it challenging to engineer interoperable code.
arXiv Detail & Related papers (2023-11-08T23:59:32Z) - When Do Program-of-Thoughts Work for Reasoning? [51.2699797837818]
We propose complexity-impacted reasoning score (CIRS) to measure correlation between code and reasoning abilities.
Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity.
Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
arXiv Detail & Related papers (2023-08-29T17:22:39Z) - Unpacking Privacy Labels: A Measurement and Developer Perspective on
Google's Data Safety Section [23.183167991569352]
We present a comprehensive analysis of Google's Data Safety Section (DSS) using both quantitative and qualitative methods.
We find that there are internal inconsistencies within the reported practices.
Next, we conduct a longitudinal study of DSS to explore how the reported practices evolve over time.
arXiv Detail & Related papers (2023-06-13T20:01:08Z) - DC-Check: A Data-Centric AI checklist to guide the development of
reliable machine learning systems [81.21462458089142]
Data-centric AI is emerging as a unifying paradigm that could enable reliable end-to-end pipelines.
We propose DC-Check, an actionable checklist-style framework to elicit data-centric considerations.
This data-centric lens on development aims to promote thoughtfulness and transparency prior to system development.
arXiv Detail & Related papers (2022-11-09T17:32:09Z) - Explainable Intrusion Detection Systems (X-IDS): A Survey of Current
Methods, Challenges, and Opportunities [0.0]
Intrusion Detection Systems (IDS) have received widespread adoption due to their ability to handle vast amounts of data with a high prediction accuracy.
IDSs designed using Deep Learning (DL) techniques are often treated as black box models and do not provide a justification for their predictions.
This survey reviews the state-of-the-art in explainable AI (XAI) for IDS, its current challenges, and discusses how these challenges span to the design of an X-IDS.
arXiv Detail & Related papers (2022-07-13T14:31:46Z) - Learning Physical Concepts in Cyber-Physical Systems: A Case Study [72.74318982275052]
We provide an overview of the current state of research regarding methods for learning physical concepts in time series data.
We also analyze the most important methods from the current state of the art using the example of a three-tank system.
arXiv Detail & Related papers (2021-11-28T14:24:52Z) - Digital Twins: State of the Art Theory and Practice, Challenges, and
Open Research Questions [62.67593386796497]
This work explores the various DT features and current approaches, the shortcomings and reasons behind the delay in the implementation and adoption of digital twin.
The major reasons for this delay are the lack of a universal reference framework, domain dependence, security concerns of shared data, reliance of digital twin on other technologies, and lack of quantitative metrics.
arXiv Detail & Related papers (2020-11-02T19:08:49Z) - Data Mining with Big Data in Intrusion Detection Systems: A Systematic
Literature Review [68.15472610671748]
Cloud computing has become a powerful and indispensable technology for complex, high performance and scalable computation.
The rapid rate and volume of data creation has begun to pose significant challenges for data management and security.
The design and deployment of intrusion detection systems (IDS) in the big data setting has, therefore, become a topic of importance.
arXiv Detail & Related papers (2020-05-23T20:57:12Z) - Deep Learning for Person Re-identification: A Survey and Outlook [233.36948173686602]
Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras.
By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings.
arXiv Detail & Related papers (2020-01-13T12:49:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.