Related papers: BLADE: Benchmarking Language Model Agents for Data-Driven Science

BLADE: Benchmarking Language Model Agents for Data-Driven Science

URL: http://arxiv.org/abs/2408.09667v2
Date: Tue, 20 Aug 2024 20:25:10 GMT
Title: BLADE: Benchmarking Language Model Agents for Data-Driven Science
Authors: Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, Yikun Zhang, Tianmai M. Zhang, Lanyi Zhu, Mike A. Merrill, Jeffrey Heer, Tim Althoff,
Abstract summary: LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. We present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions.
Score: 18.577658530714505
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However, evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature, with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses, we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents' analysis approaches.

Related papers

DSBC : Data Science task Benchmarking with Context engineering [0.0]
We introduce a benchmark to reflect real-world user interactions with data science agents.<n>We evaluate three large language models (LLMs) across three approaches.
arXiv Detail & Related papers (2025-07-31T08:32:37Z)
Dynamic Knowledge Exchange and Dual-diversity Review: Concisely Unleashing the Potential of a Multi-Agent Research Team [53.38438460574943]
IDVSCI is a multi-agent framework built on large language models (LLMs)<n>It incorporates two key innovations: a Dynamic Knowledge Exchange mechanism and a Dual-Diversity Review paradigm.<n>Results show that IDVSCI consistently achieves the best performance across two datasets.
arXiv Detail & Related papers (2025-06-23T07:12:08Z)
Towards Artificial Intelligence Research Assistant for Expert-Involved Learning [64.7438151207189]
Large Language Models (LLMs) and Large Multi-Modal Models (LMMs) have emerged as transformative tools in scientific research.<n>We present textbfARtificial textbfIntelligence research assistant for textbfExpert-involved textbfLearning (ARIEL)
arXiv Detail & Related papers (2025-05-03T14:21:48Z)
Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents [11.74019905854637]
Large language models (LLMs) are evolving into scientific agents that automate critical tasks. Unlike general-purpose LLMs, specialized agents integrate domain-specific knowledge, advanced tool sets, and robust validation mechanisms. We highlight why they differ from general agents and the ways in which they advance research across various scientific fields.
arXiv Detail & Related papers (2025-03-31T13:11:28Z)
User-centric evaluation of explainability of AI with and for humans: a comprehensive empirical study [5.775094401949666]
This study is located in the Human-Centered Artificial Intelligence (HCAI) It focuses on the results of a user-centered assessment of commonly used eXplainable Artificial Intelligence (XAI) algorithms.
arXiv Detail & Related papers (2024-10-21T12:32:39Z)
Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges. We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow. We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z)
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? [58.330879414174476]
We introduce DSBench, a benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG)
arXiv Detail & Related papers (2024-09-12T02:08:00Z)
Diagnostic Reasoning in Natural Language: Computational Model and Application [68.47402386668846]
We investigate diagnostic abductive reasoning (DAR) in the context of language-grounded tasks (NL-DAR) We propose a novel modeling framework for NL-DAR based on Pearl's structural causal models. We use the resulting dataset to investigate the human decision-making process in NL-DAR.
arXiv Detail & Related papers (2024-09-09T06:55:37Z)
DISCOVER: A Data-driven Interactive System for Comprehensive Observation, Visualization, and ExploRation of Human Behaviour [6.716560115378451]
We introduce a modular, flexible, yet user-friendly software framework specifically developed to streamline computational-driven data exploration for human behavior analysis. Our primary objective is to democratize access to advanced computational methodologies, thereby enabling researchers across disciplines to engage in detailed behavioral analysis without the need for extensive technical proficiency.
arXiv Detail & Related papers (2024-07-18T11:28:52Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is an AI-based system for ideation and operationalization of novel work. ResearchAgent automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them. We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z)
Benchmarking Data Science Agents [11.582116078653968]
Large Language Models (LLMs) have emerged as promising aids as data science agents, assisting humans in data analysis and processing. Yet their practical efficacy remains constrained by the varied demands of real-world applications and complicated analytical process. We introduce DSEval -- a novel evaluation paradigm, as well as a series of innovative benchmarks tailored for assessing the performance of these agents.
arXiv Detail & Related papers (2024-02-27T03:03:06Z)
MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks. MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z)
Can Large Language Models Serve as Data Analysts? A Multi-Agent Assisted Approach for Qualitative Data Analysis [6.592797748561459]
Large Language Models (LLMs) have enabled collaborative human-bot interactions in Software Engineering (SE) We introduce a new dimension of scalability and accuracy in qualitative research, potentially transforming data interpretation methodologies in SE.
arXiv Detail & Related papers (2024-02-02T13:10:46Z)
Machine-assisted quantitizing designs: augmenting humanities and social sciences with artificial intelligence [0.0]
Large language models (LLMs) have been shown to present an unprecedented opportunity to scale up data analytics in the humanities and social sciences. We build on mixed methods quantitizing and converting design principles, and feature analysis from linguistics, to transparently integrate human expertise and machine scalability. The approach is discussed and demonstrated in over a dozen LLM-assisted case studies, covering 9 diverse languages, multiple disciplines and tasks.
arXiv Detail & Related papers (2023-09-24T14:21:50Z)
Application of Transformers based methods in Electronic Medical Records: A Systematic Literature Review [77.34726150561087]
This work presents a systematic literature review of state-of-the-art advances using transformer-based methods on electronic medical records (EMRs) in different NLP tasks.
arXiv Detail & Related papers (2023-04-05T22:19:42Z)
Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.