Related papers: SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems

URL: http://arxiv.org/abs/2509.23130v2
Date: Tue, 30 Sep 2025 07:31:57 GMT
Title: SysMoBench: Evaluating AI on Formally Modeling Complex Real-World Systems
Authors: Qian Cheng, Ruize Tang, Emilie Ma, Finn Hackett, Peiyang He, Yiming Su, Ivan Beschastnikh, Yu Huang, Xiaoxing Ma, Tianyin Xu,
Abstract summary: We present SysMoBench, a benchmark that evaluates AI's ability to formally model large, complex systems.<n>We focus on concurrent and distributed systems, which are keystones of today's critical computing infrastructures.
Score: 12.181911851729614
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Formal models are essential to specifying large, complex computer systems and verifying their correctness, but are notoriously expensive to write and maintain. Recent advances in generative AI show promise in generating certain forms of specifications. However, existing work mostly targets small code, not complete systems. It is unclear whether AI can deal with realistic system artifacts, as this requires abstracting their complex behavioral properties into formal models. We present SysMoBench, a benchmark that evaluates AI's ability to formally model large, complex systems. We focus on concurrent and distributed systems, which are keystones of today's critical computing infrastructures, encompassing operating systems and cloud infrastructure. We use TLA+, the de facto specification language for concurrent and distributed systems, though the benchmark can be extended to other specification languages. We address the primary challenge of evaluating AI-generated models by automating metrics like syntactic and runtime correctness, conformance to system code, and invariant correctness. SysMoBench currently includes nine diverse system artifacts: the Raft implementation of Etcd and Redis, the Spinlock and Mutex in Asterinas OS, etc.; more artifacts are being actively added. SysMoBench enables us to understand the capabilities and limitations of today's LLMs and agents, putting tools in this area on a firm footing and opening up promising new research directions.

Related papers

Agentic Artificial Intelligence (AI): Architectures, Taxonomies, and Evaluation of Large Language Model Agents [14.448267395835721]
We propose a unified taxonomy that breaks agents into Perception, Brain, Planning, Action, Tool Use, and Collaboration.<n>We also group the environments in which these agents operate, including digital operating systems, embodied robotics, and other specialized domains.
arXiv Detail & Related papers (2026-01-18T19:51:16Z)
Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models [78.73992315826035]
We introduce Youtu-LLM, a lightweight language model that harmonizes high computational efficiency with native agentic intelligence.<n>Youtu-LLM is pre-trained from scratch to systematically cultivate reasoning and planning capabilities.
arXiv Detail & Related papers (2025-12-31T04:25:11Z)
Validity Is What You Need [3.0111718611142684]
We consider other definitions of Agentic AI and propose a new realist definition.<n>We note, however, that Agentic AI systems are primarily applications, not foundations.
arXiv Detail & Related papers (2025-10-31T17:00:04Z)
Dissect-and-Restore: AI-based Code Verification with Transient Refactoring [1.2883590530210827]
We present Prometheus, a novel AI-assisted system that facilitates automated code verification with current AI capabilities.<n>Prometheus guides the proof search through structured decomposition of complex lemmas into smaller, verifiable sub-lemmas.<n>This approach successfully verifies 86% of tasks in our curated dataset, compared to 68% for the baseline.
arXiv Detail & Related papers (2025-10-29T11:23:50Z)
Embodied AI: From LLMs to World Models [65.68972714346909]
Embodied Artificial Intelligence (AI) is an intelligent system paradigm for achieving Artificial General Intelligence (AGI)<n>Recent breakthroughs in Large Language Models (LLMs) and World Models (WMs) have drawn significant attention for embodied AI.
arXiv Detail & Related papers (2025-09-24T11:37:48Z)
Is the `Agent' Paradigm a Limiting Framework for Next-Generation Intelligent Systems? [0.0]
The concept of the 'agent' has profoundly shaped Artificial Intelligence (AI) research.<n>This paper critically re-evaluates the necessity and optimality of this agent-centric paradigm.
arXiv Detail & Related papers (2025-09-13T16:11:27Z)
Reasoning Language Models: A Blueprint [16.04440875855868]
Reasoning language models (RLMs) have redefined AI's problem-solving capabilities.<n>Yet, their high costs, proprietary nature, and complex architectures present accessibility and scalability challenges.<n>We propose a comprehensive blueprint that organizes RLM components into a modular framework.
arXiv Detail & Related papers (2025-01-20T02:16:19Z)
Fundamental Risks in the Current Deployment of General-Purpose AI Models: What Have We (Not) Learnt From Cybersecurity? [60.629883024152576]
Large Language Models (LLMs) have seen rapid deployment in a wide range of use cases.<n>OpenAIs Altera are just a few examples of increased autonomy, data access, and execution capabilities.<n>These methods come with a range of cybersecurity challenges.
arXiv Detail & Related papers (2024-12-19T14:44:41Z)
Large Action Models: From Inception to Implementation [51.81485642442344]
Large Action Models (LAMs) are designed for action generation and execution within dynamic environments.<n>LAMs hold the potential to transform AI from passive language understanding to active task completion.<n>We present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment.
arXiv Detail & Related papers (2024-12-13T11:19:56Z)
Specifications: The missing link to making the development of LLM systems an engineering discipline [65.10077876035417]
We discuss the progress the field has made so far-through advances like structured outputs, process supervision, and test-time compute.<n>We outline several future directions for research to enable the development of modular and reliable LLM-based systems.
arXiv Detail & Related papers (2024-11-25T07:48:31Z)
Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback [52.763620660061115]
ONI is a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function.<n>We explore a range of algorithmic choices for reward modeling with varying complexity.<n>Our approach achieves state-of-the-art performance across a range of challenging tasks from the NetHack Learning Environment.
arXiv Detail & Related papers (2024-10-30T13:52:43Z)
Towards Single-System Illusion in Software-Defined Vehicles -- Automated, AI-Powered Workflow [3.2821049498759094]
We propose a novel model- and feature-based approach to development of vehicle software systems. One of the key points of the presented approach is the inclusion of modern generative AI, specifically Large Language Models (LLMs) The resulting pipeline is automated to a large extent, with feedback being generated at each step.
arXiv Detail & Related papers (2024-03-21T15:07:57Z)
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs [71.7495056818522]
We introduce TaskMatrix.AI as a new AI ecosystem that connects foundation models with millions of APIs for task completion. We will present our vision of how to build such an ecosystem, explain each key component, and use study cases to illustrate both the feasibility of this vision and the main challenges we need to address next.
arXiv Detail & Related papers (2023-03-29T03:30:38Z)
Machine Learning Systems: A Survey from a Data-Oriented Perspective [6.933671804969495]
Data-oriented Architecture (DOA) is an emerging style that equips systems better for integrating ML models.<n>This paper surveys why, how, and to what extent practitioners have adopted DOA to implement and deploy ML-based systems.
arXiv Detail & Related papers (2023-02-09T17:57:02Z)
MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning [50.40151403246205]
Huge language models (LMs) have ushered in a new era for AI, serving as a gateway to natural-language-based knowledge tasks. We define a flexible architecture with multiple neural models, complemented by discrete knowledge and reasoning modules. We describe this neuro-symbolic architecture, dubbed the Modular Reasoning, Knowledge and Language (MRKL) system.
arXiv Detail & Related papers (2022-05-01T11:01:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.