InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation
- URL: http://arxiv.org/abs/2510.09724v1
- Date: Fri, 10 Oct 2025 07:55:46 GMT
- Title: InteractScience: Programmatic and Visually-Grounded Evaluation of Interactive Scientific Demonstration Code Generation
- Authors: Qiaosheng Chen, Yang Liu, Lei Li, Kai Chen, Qipeng Guo, Gong Cheng, Fei Yuan,
- Abstract summary: Large Language Models (LLMs) are increasingly capable of generating complete applications from natural language instructions.<n> Generating scientific demonstrations requires models to combine accurate scientific knowledge with the ability to implement interactive front-end code.<n>We present InteractScience, a benchmark consisting of a substantial set of carefully designed questions across five scientific domains.
- Score: 47.17929896747628
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are increasingly capable of generating complete applications from natural language instructions, creating new opportunities in science and education. In these domains, interactive scientific demonstrations are particularly valuable for explaining concepts, supporting new teaching methods, and presenting research findings. Generating such demonstrations requires models to combine accurate scientific knowledge with the ability to implement interactive front-end code that behaves correctly and responds to user actions. This capability goes beyond the scope of existing benchmarks, which typically evaluate either knowledge question answering without grounding in code or static web code generation without scientific interactivity. To evaluate this integrated ability, we design a hybrid framework that combines programmatic functional testing to rigorously verify interaction logic with visually-grounded qualitative testing to assess rendered outputs against reference snapshots. Building on this framework, we present InteractScience, a benchmark consisting of a substantial set of carefully designed questions across five scientific domains, each paired with unit tests, reference snapshots, and checklists. We evaluate 30 leading open- and closed-source LLMs and report results that highlight ongoing weaknesses in integrating domain knowledge with interactive front-end coding. Our work positions InteractScience as the first benchmark to automatically measure this combined capability with realistic interactive operations, providing a foundation for advancing reliable and educationally useful scientific demonstration code generation. All code and data are publicly available at https://github.com/open-compass/InteractScience.
Related papers
- El Agente Gráfico: Structured Execution Graphs for Scientific Agents [7.47895130442454]
We present El Agente Grfico, a single-agent framework that embeds large language models (LLMs)-driven decision-making within a type-safe execution environment.<n>Central to our approach is a structured abstraction of scientific concepts and an object-graph mapper that represents computational state as typed Python objects.<n>We evaluate the system by developing an automated benchmarking framework across a suite of university-level quantum chemistry tasks.
arXiv Detail & Related papers (2026-02-19T23:47:05Z) - Accelerating Scientific Research with Gemini: Case Studies and Common Techniques [105.15622072347811]
Large language models (LLMs) have opened new avenues for accelerating scientific research.<n>We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models.
arXiv Detail & Related papers (2026-02-03T18:56:17Z) - AInsteinBench: Benchmarking Coding Agents on Scientific Repositories [33.48206557020983]
AInsteinBench is a large-scale benchmark for evaluating whether large language model (LLM) agents can operate as scientific computing development agents.<n>AInsteinBench measures a model's ability to move beyond surface-level code generation toward the core competencies required for computational scientific research.
arXiv Detail & Related papers (2025-12-24T08:11:11Z) - FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI [24.545163508739943]
FreeAskWorld is an interactive simulation framework that integrates large language models for high-level behavior planning and semantically grounded interaction.<n>Our framework supports scalable, realistic human-agent simulations and includes a modular data generation pipeline tailored for diverse embodied tasks.<n>We present and publicly release FreeAskWorld, a large-scale benchmark dataset comprising reconstructed environments, six diverse task types, 16 core object categories, 63,429 annotated sample frames, and more than 17 hours of interaction data.
arXiv Detail & Related papers (2025-11-17T15:58:46Z) - Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection [51.52749744031413]
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions.<n>Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues.<n>We propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics.
arXiv Detail & Related papers (2025-07-23T12:30:19Z) - Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team [53.38438460574943]
IDVSCI is a multi-agent framework built on large language models (LLMs)<n>It incorporates two key innovations: a Dynamic Knowledge Exchange mechanism and a Dual-Diversity Review paradigm.<n>Results show that IDVSCI consistently achieves the best performance across two datasets.
arXiv Detail & Related papers (2025-06-23T07:12:08Z) - ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows [82.07367406991678]
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing.<n>Among these, computer-using agents are capable of interacting with operating systems as humans do.<n>We introduce ScienceBoard, which encompasses a realistic, multi-domain environment featuring dynamic and visually rich scientific software.
arXiv Detail & Related papers (2025-05-26T12:27:27Z) - Data Science Principles for Interpretable and Explainable AI [0.7581664835990121]
Interpretable and interactive machine learning aims to make complex models more transparent and controllable.
This review synthesizes key principles from the growing literature in this field.
arXiv Detail & Related papers (2024-05-17T05:32:27Z) - The Future of Scientific Publishing: Automated Article Generation [0.0]
This study introduces a novel software tool leveraging large language model (LLM) prompts, designed to automate the generation of academic articles from Python code.
Python served as a foundational proof of concept; however, the underlying methodology and framework exhibit adaptability across various GitHub repo's.
The development was achieved without reliance on advanced language model agents, ensuring high fidelity in the automated generation of coherent and comprehensive academic content.
arXiv Detail & Related papers (2024-04-11T16:47:02Z) - Interactive Natural Language Processing [67.87925315773924]
Interactive Natural Language Processing (iNLP) has emerged as a novel paradigm within the field of NLP.
This paper offers a comprehensive survey of iNLP, starting by proposing a unified definition and framework of the concept.
arXiv Detail & Related papers (2023-05-22T17:18:29Z) - Automated Creation and Human-assisted Curation of Computable Scientific
Models from Code and Text [2.3746609573239756]
Domain experts cannot gain a complete understanding of the implementation of a scientific model if they are not familiar with the code.
We develop a system for the automated creation and human-assisted curation of scientific models.
We present experimental results obtained using a dataset of code and associated text derived from NASA's Hypersonic Aerodynamics website.
arXiv Detail & Related papers (2022-01-28T17:31:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.