How Well Can AI Build SD Models?
- URL: http://arxiv.org/abs/2503.15580v1
- Date: Wed, 19 Mar 2025 14:48:47 GMT
- Title: How Well Can AI Build SD Models?
- Authors: William Schoenberg, Davidson Girard, Saras Chung, Ellen O'Neill, Janet Velasquez, Sara Metcalf,
- Abstract summary: We introduce two metrics for evaluation of AI-generated causal maps: technical correctness (causal translation) and adherence to instructions (conformance)<n>We tested 11 different LLMs on their ability to do causal translation as well as conform to user instruction.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Introduction: As system dynamics (SD) embraces automation, AI offers efficiency but risks bias from missing data and flawed models. Models that omit multiple perspectives and data threaten model quality, whether created by humans or with the assistance of AI. To reduce uncertainty about how well AI can build SD models, we introduce two metrics for evaluation of AI-generated causal maps: technical correctness (causal translation) and adherence to instructions (conformance). Approach: We developed an open source project called sd-ai to provide a basis for collaboration in the SD community, aiming to fully harness the potential of AI based tools like ChatGPT for dynamic modeling. Additionally, we created an evaluation theory along with a comprehensive suite of tests designed to evaluate any such tools developed within the sd-ai ecosystem. Results: We tested 11 different LLMs on their ability to do causal translation as well as conform to user instruction. gpt-4.5-preview was the top performer, scoring 92.9% overall, excelling in both tasks. o1 scored 100% in causal translation. gpt-4o identified all causal links but struggled with positive polarity in decreasing terms. While gpt-4.5-preview and o1 are most accurate, gpt-4o is the cheapest. Discussion: Causal translation and conformance tests applied to the sd-ai engine reveal significant variations across lLLMs, underscoring the need for continued evaluation to ensure responsible development of AI tools for dynamic modeling. To address this, an open collaboration among tool developers, modelers, and stakeholders is launched to standardize measures for evaluating the capacity of AI tools to improve the modeling process.
Related papers
- General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.
We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.
Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z) - VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z) - Linear Discriminant Analysis in Credit Scoring: A Transparent Hybrid Model Approach [9.88281854509076]
We implement Linear Discriminant Analysis (LDA) as a feature reduction technique, which reduces the burden of the models complexity.
Our hybrid model, XG-DNN, outperformed other models with the highest accuracy of 99.45% and a 99% F1 score with LDA.
To interpret model decisions, we have applied 2 different explainable AI techniques named LIME (local) and Morris Sensitivity Analysis (global)
arXiv Detail & Related papers (2024-12-05T14:21:18Z) - Fine-tuning Large Language Models for Automated Diagnostic Screening Summaries [0.024105148723769353]
We evaluate several state-of-the-art Large Language Models (LLMs) for generating concise summaries from mental state examinations.
We rigorously evaluate four different models for summary generation using established ROUGE metrics and input from human evaluators.
Our top-performing fine-tuned model outperforms existing models, achieving ROUGE-1 and ROUGE-L values of 0.810 and 0.764, respectively.
arXiv Detail & Related papers (2024-03-29T12:25:37Z) - Cloud-based XAI Services for Assessing Open Repository Models Under Adversarial Attacks [7.500941533148728]
We propose a cloud-based service framework that encapsulates computing components and assessment tasks into pipelines.
We demonstrate the application of XAI services for assessing five quality attributes of AI models.
arXiv Detail & Related papers (2024-01-22T00:37:01Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - A comprehensible analysis of the efficacy of Ensemble Models for Bug
Prediction [0.0]
We present a comparison and analysis of the efficacy of two AI-based approaches, namely single AI models and ensemble AI models, for predicting the probability of a Java class being buggy.
Our experimental findings indicate that the ensemble of AI models can outperform the results of applying individual AI models.
arXiv Detail & Related papers (2023-10-18T17:43:54Z) - ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [55.485985317538194]
ProcTHOR is a framework for procedural generation of Embodied AI environments.
We demonstrate state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation.
arXiv Detail & Related papers (2022-06-14T17:09:35Z) - Data-Driven and SE-assisted AI Model Signal-Awareness Enhancement and
Introspection [61.571331422347875]
We propose a data-driven approach to enhance models' signal-awareness.
We combine the SE concept of code complexity with the AI technique of curriculum learning.
We achieve up to 4.8x improvement in model signal awareness.
arXiv Detail & Related papers (2021-11-10T17:58:18Z) - Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring
Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics.
We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z) - Model-based actor-critic: GAN (model generator) + DRL (actor-critic) =>
AGI [0.0]
We propose adding an (generative/predictive) environment model to the actor-critic (model-free) architecture.
The proposed AI model is similar to (model-free) DDPG and therefore it's called model-based DDPG.
Our initial limited experiments show that DRL and GAN in model-based actor-critic results in an incremental goal-driven intellignce required to solve each task with similar performance to (model-free) DDPG.
arXiv Detail & Related papers (2020-04-04T02:05:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.