On the Evaluation of Engineering Artificial General Intelligence
- URL: http://arxiv.org/abs/2505.10653v1
- Date: Thu, 15 May 2025 18:52:47 GMT
- Title: On the Evaluation of Engineering Artificial General Intelligence
- Authors: Sandeep Neema, Susmit Jha, Adam Nagel, Ethan Lew, Chandrasekar Sureshkumar, Aleksa Gordic, Chase Shimmin, Hieu Nguygen, Paul Eremenko,
- Abstract summary: We discuss the challenges and propose a framework for evaluating engineering artificial general intelligence (eAGI) agents.<n>We consider eAGI as a specialization of artificial general intelligence (AGI)<n>eAGI agents should possess a unique blend of background knowledge (recall and retrieve) of facts and methods.
- Score: 5.802869598386355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We discuss the challenges and propose a framework for evaluating engineering artificial general intelligence (eAGI) agents. We consider eAGI as a specialization of artificial general intelligence (AGI), deemed capable of addressing a broad range of problems in the engineering of physical systems and associated controllers. We exclude software engineering for a tractable scoping of eAGI and expect dedicated software engineering AI agents to address the software implementation challenges. Similar to human engineers, eAGI agents should possess a unique blend of background knowledge (recall and retrieve) of facts and methods, demonstrate familiarity with tools and processes, exhibit deep understanding of industrial components and well-known design families, and be able to engage in creative problem solving (analyze and synthesize), transferring ideas acquired in one context to another. Given this broad mandate, evaluating and qualifying the performance of eAGI agents is a challenge in itself and, arguably, a critical enabler to developing eAGI agents. In this paper, we address this challenge by proposing an extensible evaluation framework that specializes and grounds Bloom's taxonomy - a framework for evaluating human learning that has also been recently used for evaluating LLMs - in an engineering design context. Our proposed framework advances the state of the art in benchmarking and evaluation of AI agents in terms of the following: (a) developing a rich taxonomy of evaluation questions spanning from methodological knowledge to real-world design problems; (b) motivating a pluggable evaluation framework that can evaluate not only textual responses but also evaluate structured design artifacts such as CAD models and SysML models; and (c) outlining an automatable procedure to customize the evaluation benchmark to different engineering contexts.
Related papers
- The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - Rethinking Machine Unlearning in Image Generation Models [59.697750585491264]
CatIGMU is a novel hierarchical task categorization framework.<n>EvalIGMU is a comprehensive evaluation framework.<n>We construct DataIGM, a high-quality unlearning dataset.
arXiv Detail & Related papers (2025-06-03T11:25:14Z) - Computational Safety for Generative AI: A Signal Processing Perspective [65.268245109828]
computational safety is a mathematical framework that enables the quantitative assessment, formulation, and study of safety challenges in GenAI.<n>We show how sensitivity analysis and loss landscape analysis can be used to detect malicious prompts with jailbreak attempts.<n>We discuss key open research challenges, opportunities, and the essential role of signal processing in computational AI safety.
arXiv Detail & Related papers (2025-02-18T02:26:50Z) - Work in Progress: AI-Powered Engineering-Bridging Theory and Practice [0.0]
This paper explores how generative AI can help automate and improve key steps in systems engineering.<n>It examines AI's ability to analyze system requirements based on INCOSE's "good requirement" criteria.<n>The research aims to assess AI's potential to streamline engineering processes and improve learning outcomes.
arXiv Detail & Related papers (2025-02-06T17:42:00Z) - Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges.
We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow.
We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z) - How critically can an AI think? A framework for evaluating the quality of thinking of generative artificial intelligence [0.9671462473115854]
Generative AI such as those with large language models have created opportunities for innovative assessment design practices.
This paper presents a framework that explores the capabilities of the LLM ChatGPT4 application, which is the current industry benchmark.
This critique will provide specific and targeted indications of their questions vulnerabilities in terms of the critical thinking skills.
arXiv Detail & Related papers (2024-06-20T22:46:56Z) - What Does Evaluation of Explainable Artificial Intelligence Actually Tell Us? A Case for Compositional and Contextual Validation of XAI Building Blocks [16.795332276080888]
We propose a fine-grained validation framework for explainable artificial intelligence systems.
We recognise their inherent modular structure: technical building blocks, user-facing explanatory artefacts and social communication protocols.
arXiv Detail & Related papers (2024-03-19T13:45:34Z) - Levels of AGI for Operationalizing Progress on the Path to AGI [64.59151650272477]
We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors.
This framework introduces levels of AGI performance, generality, and autonomy, providing a common language to compare models, assess risks, and measure progress along the path to AGI.
arXiv Detail & Related papers (2023-11-04T17:44:58Z) - Requirements Engineering Framework for Human-centered Artificial
Intelligence Software Systems [9.642259026572175]
We present a new framework developed based on human-centered AI guidelines and a user survey to aid in collecting requirements for human-centered AI-based software.
The framework is applied to a case study to elicit and model requirements for enhancing the quality of 360 degreevideos intended for virtual reality (VR) users.
arXiv Detail & Related papers (2023-03-06T06:37:50Z) - An interdisciplinary conceptual study of Artificial Intelligence (AI)
for helping benefit-risk assessment practices: Towards a comprehensive
qualification matrix of AI programs and devices (pre-print 2020) [55.41644538483948]
This paper proposes a comprehensive analysis of existing concepts coming from different disciplines tackling the notion of intelligence.
The aim is to identify shared notions or discrepancies to consider for qualifying AI systems.
arXiv Detail & Related papers (2021-05-07T12:01:31Z) - Synergizing Domain Expertise with Self-Awareness in Software Systems: A
Patternized Architecture Guideline [11.155059219430207]
This paper highlights the importance of synergizing domain expertise and the self-awareness to enable better self-adaptation in software systems.
We present a holistic framework of notions, enriched patterns and methodology, dubbed DBASES, that offers a principled guideline for the engineers.
arXiv Detail & Related papers (2020-01-20T12:17:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.