Related papers: SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital Twins

SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital Twins

URL: http://arxiv.org/abs/2408.11987v1
Date: Wed, 21 Aug 2024 20:52:32 GMT
Title: SimBench: A Rule-Based Multi-Turn Interaction Benchmark for Evaluating an LLM's Ability to Generate Digital Twins
Authors: Jingquan Wang, Harry Zhang, Huzaifa Mustafa Unjhawala, Peter Negrut, Shu Wang, Khailanii Slaton, Radu Serban, Jin-Long Wu, Dan Negrut,
Abstract summary: We introduce SimBench, a benchmark designed to evaluate the proficiency of student large language models (S-LLMs) in generating digital twins (DTs) Given a collection of S-LLMs, this benchmark enables the ranking of the S-LLMs based on their ability to produce high-quality DTs.
Score: 8.244444633880603
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce SimBench, a benchmark designed to evaluate the proficiency of student large language models (S-LLMs) in generating digital twins (DTs) that can be used in simulators for virtual testing. Given a collection of S-LLMs, this benchmark enables the ranking of the S-LLMs based on their ability to produce high-quality DTs. We demonstrate this by comparing over 20 open- and closed-source S-LLMs. Using multi-turn interactions, SimBench employs a rule-based judge LLM (J-LLM) that leverages both predefined rules and human-in-the-loop guidance to assign scores for the DTs generated by the S-LLM, thus providing a consistent and expert-inspired evaluation protocol. The J-LLM is specific to a simulator, and herein the proposed benchmarking approach is demonstrated in conjunction with the Chrono multi-physics simulator. Chrono provided the backdrop used to assess an S-LLM in relation to the latter's ability to create digital twins for multibody dynamics, finite element analysis, vehicle dynamics, robotic dynamics, and sensor simulations. The proposed benchmarking principle is broadly applicable and enables the assessment of an S-LLM's ability to generate digital twins for other simulation packages. All code and data are available at https://github.com/uwsbel/SimBench.

Related papers

Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following [12.668201122427101]
Large Language Models (LLMs) are increasingly widely used to simulate personas in virtual environments. We show that even state-of-the-art LLMs cannot simulate personas with reversed performance.
arXiv Detail & Related papers (2025-04-08T22:00:32Z)
Fast, Modular, and Differentiable Framework for Machine Learning-Enhanced Molecular Simulations [12.00988094580341]
We present an end-to-end differentiable molecular simulation framework (DIMOS) for molecular dynamics and Monte Carlo simulations. Thanks to its modularity, both classical and machine-learning-based approaches can be easily combined into a hybrid description of the system (ML/MM) The superior performance and the high versatility is probed in different benchmarks and applications, with speed-up factors of up to $170times$.
arXiv Detail & Related papers (2025-03-26T13:39:10Z)
Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark [72.46357004059661]
We propose Similar, a step-wise Multi-dimensional Generalist Reward Model. It offers fine-grained signals for agent training and can choose better action for inference-time scaling. We introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation.
arXiv Detail & Related papers (2025-03-24T13:30:47Z)
HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing [0.0]
We introduce Hack Synth, a novel Large Language Model (LLM)-based agent capable of autonomous penetration testing. To benchmark Hack Synth, we propose two new Capture The Flag (CTF)-based benchmark sets utilizing the popular platforms PicoCTF and OverTheWire.
arXiv Detail & Related papers (2024-12-02T18:28:18Z)
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation [48.17611255751571]
Post-training is essential for enabling large language models to follow human instructions. We leverage multi-agent simulation to automatically generate diverse text-based scenarios. We introduce a novel scenario-driven instruction generator MATRIX-Gen for controllable and highly realistic data synthesis.
arXiv Detail & Related papers (2024-10-18T08:01:39Z)
LLM experiments with simulation: Large Language Model Multi-Agent System for Simulation Model Parametrization in Digital Twins [4.773175285216063]
This paper presents a novel framework that applies large language models (LLMs) to automate the parametrization of simulation models in digital twins. The proposed approach enhances the usability of simulation model by infusing it with knowledges from LLM. The system has the potential to increase user-friendliness and reduce the cognitive load on human users.
arXiv Detail & Related papers (2024-05-28T11:59:40Z)
Investigating the Robustness of Counterfactual Learning to Rank Models: A Reproducibility Study [61.64685376882383]
Counterfactual learning to rank (CLTR) has attracted extensive attention in the IR community for its ability to leverage massive logged user interaction data to train ranking models. This paper investigates the robustness of existing CLTR models in complex and diverse situations. We find that the DLA models and IPS-DCM show better robustness under various simulation settings than IPS-PBM and PRS with offline propensity estimation.
arXiv Detail & Related papers (2024-04-04T10:54:38Z)
How Reliable is Your Simulator? Analysis on the Limitations of Current LLM-based User Simulators for Conversational Recommendation [14.646529557978512]
We analyze the limitations of using Large Language Models in constructing user simulators for Conversational Recommender System. Data leakage, which occurs in conversational history and the user simulator's replies, results in inflated evaluation results. We propose SimpleUserSim, employing a straightforward strategy to guide the topic toward the target items.
arXiv Detail & Related papers (2024-03-25T04:21:06Z)
LibSignal: An Open Library for Traffic Signal Control [8.290016666341755]
This paper introduces a library for cross-simulator comparison of reinforcement learning models in traffic signal control tasks. It supports commonly-used simulators in traffic signal control tasks, including of Urban MObility(SUMO) and CityFlow. This is the first time that these methods have been compared fairly under the same datasets with different simulators.
arXiv Detail & Related papers (2022-11-19T10:21:50Z)
Forces are not Enough: Benchmark and Critical Evaluation for Machine Learning Force Fields with Molecular Simulations [5.138982355658199]
Molecular dynamics (MD) simulation techniques are widely used for various natural science applications. We benchmark a collection of state-of-the-art (SOTA) ML FF models and illustrate, in particular, how the commonly benchmarked force accuracy is not well aligned with relevant simulation metrics.
arXiv Detail & Related papers (2022-10-13T17:59:03Z)
Synthetic Data-Based Simulators for Recommender Systems: A Survey [55.60116686945561]
This survey aims at providing a comprehensive overview of the recent trends in the field of modeling and simulation. We start with the motivation behind the development of frameworks implementing the simulations -- simulators. We provide a new consistent classification of existing simulators based on their functionality, approbation, and industrial effectiveness.
arXiv Detail & Related papers (2022-06-22T19:33:21Z)
Metaphorical User Simulators for Evaluating Task-oriented Dialogue Systems [80.77917437785773]
Task-oriented dialogue systems ( TDSs) are assessed mainly in an offline setting or through human evaluation. We propose a metaphorical user simulator for end-to-end TDS evaluation, where we define a simulator to be metaphorical if it simulates user's analogical thinking in interactions with systems. We also propose a tester-based evaluation framework to generate variants, i.e., dialogue systems with different capabilities.
arXiv Detail & Related papers (2022-04-02T05:11:03Z)
Using Machine Learning at Scale in HPC Simulations with SmartSim: An Application to Ocean Climate Modeling [52.77024349608834]
We demonstrate the first climate-scale, numerical ocean simulations improved through distributed, online inference of Deep Neural Networks (DNN) using SmartSim. SmartSim is a library dedicated to enabling online analysis and Machine Learning (ML) for traditional HPC simulations.
arXiv Detail & Related papers (2021-04-13T19:27:28Z)
A User's Guide to Calibrating Robotics Simulators [54.85241102329546]
This paper proposes a set of benchmarks and a framework for the study of various algorithms aimed to transfer models and policies learnt in simulation to the real world. We conduct experiments on a wide range of well known simulated environments to characterize and offer insights into the performance of different algorithms. Our analysis can be useful for practitioners working in this area and can help make informed choices about the behavior and main properties of sim-to-real algorithms.
arXiv Detail & Related papers (2020-11-17T22:24:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.