Related papers: Can AI Master Econometrics? Evidence from Econometrics AI Agent on Expert-Level Tasks

Related papers

How to Build AI Agents by Augmenting LLMs with Codified Human Expert Domain Knowledge? A Software Engineering Framework [3.0049184484925604]
Critical domain knowledge typically resides with few experts.<n>Non-experts struggle to create effective visualizations.<n>This paper investigates how to capture and embed human domain knowledge into AI agent systems.
arXiv Detail & Related papers (2026-01-21T16:23:22Z)
SelfAI: Building a Self-Training AI System with LLM Agents [79.10991818561907]
SelfAI is a general multi-agent platform that combines a User Agent for translating high-level research objectives into standardized experimental configurations.<n>An Experiment Manager orchestrates parallel, fault-tolerant training across heterogeneous hardware while maintaining a structured knowledge base for continuous feedback.<n>Across regression, computer vision, scientific computing, medical imaging, and drug discovery benchmarks, SelfAI consistently achieves strong performance and reduces redundant trials.
arXiv Detail & Related papers (2025-11-29T09:18:39Z)
Artificial Intelligence Agents in Music Analysis: An Integrative Perspective Based on Two Use Cases [0.0]
This paper presents an integrative review and experimental validation of artificial intelligence (AI) agents applied to music analysis and education.<n>We synthesize the historical evolution from rule-based models to contemporary approaches involving deep learning, multi-agent architectures, and retrieval-augmented generation frameworks.<n> Experimental results demonstrate that AI agents effectively enhance musical pattern recognition, compositional parameterization, and educational feedback.
arXiv Detail & Related papers (2025-11-17T23:46:47Z)
Creative Adversarial Testing (CAT): A Novel Framework for Evaluating Goal-Oriented Agentic AI Systems [0.0]
Creative Adversarial Testing (CAT) is a novel approach designed to capture and analyze the complex relationship between Agentic AI tasks and the system's intended objectives.<n>We validate the CAT framework through extensive simulation using synthetic interaction data modeled after Alexa+ audio services.<n>Our results demonstrate that the CAT framework provides unprecedented insights into goal-task alignment, enabling more effective optimization and development of Agentic AI systems.
arXiv Detail & Related papers (2025-09-26T23:52:20Z)
An Approach to Grounding AI Model Evaluations in Human-derived Criteria [0.0]
We propose a novel approach to augment existing benchmarks with human-derived evaluation criteria.<n>Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large-scale surveys.<n>Our findings reveal that participants perceive AI as lacking in interpretive and empathetic skills yet hold high expectations for AI performance.
arXiv Detail & Related papers (2025-09-04T21:40:32Z)
Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training [67.895981259683]
General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence.<n>Current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools.<n>We present Cognitive Kernel-Pro, a fully open-source and (to the maximum extent) free multi-module agent framework.
arXiv Detail & Related papers (2025-08-01T08:11:31Z)
AI, Humans, and Data Science: Optimizing Roles Across Workflows and the Workforce [0.0]
We consider the potential and limitation of analytic, generative, and agentic AI to augment data scientists or take on tasks traditionally done by human analysts and researchers.<n>Just as earlier eras of survey analysis created issues when the increased ease of using statistical software allowed researchers to conduct analyses they did not fully understand, the new AI tools may create similar but larger risks.
arXiv Detail & Related papers (2025-07-15T17:59:06Z)
AutoMind: Adaptive Knowledgeable Agent for Automated Data Science [70.33796196103499]
Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems.<n>Existing frameworks depend on rigid, pre-defined and inflexible coding strategies.<n>We introduce AutoMind, an adaptive, knowledgeable LLM-agent framework.
arXiv Detail & Related papers (2025-06-12T17:59:32Z)
Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study [15.97770416681533]
Software engineering agents (SWE agents) operate autonomously by interpreting user input and responding to environmental feedback.<n>We present the first systematic study of SWE agent behavior through the lens of execution traces.
arXiv Detail & Related papers (2025-06-10T00:41:54Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
TimeSeriesGym: A Scalable Benchmark for (Time Series) Machine Learning Engineering Agents [17.296425855109426]
We introduce TimeSeriesGym, a scalable benchmarking framework for evaluating Artificial Intelligence (AI) agents.<n>TimeSeriesGym incorporates challenges from diverse sources spanning multiple domains and tasks.<n>We implement evaluation mechanisms for multiple research artifacts, including submission files, code, and models.
arXiv Detail & Related papers (2025-05-19T16:11:23Z)
MLGym: A New Framework and Benchmark for Advancing AI Research Agents [51.9387884953294]
We introduce Meta MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing large language models on AI research tasks.<n>This is the first Gym environment for machine learning (ML) tasks, enabling research on reinforcement learning (RL) algorithms for training such agents.<n>We evaluate a number of frontier large language models (LLMs) on our benchmarks such as Claude-3.5-Sonnet, Llama-3.1 405B, GPT-4o, o1-preview, and Gemini-1.5 Pro.
arXiv Detail & Related papers (2025-02-20T12:28:23Z)
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [52.46737975742287]
We introduce TheAgentCompany, a benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker.<n>We find that the most competitive agent can complete 30% of tasks autonomously.<n>This paints a nuanced picture on task automation with simulating LM agents in a setting a real workplace.
arXiv Detail & Related papers (2024-12-18T18:55:40Z)
The BrowserGym Ecosystem for Web Agent Research [151.90034093362343]
BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents.<n>We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature.<n>We conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks.
arXiv Detail & Related papers (2024-12-06T23:43:59Z)
Follow the money: a startup-based measure of AI exposure across occupations, industries and regions [0.0]
Existing measures of AI occupational exposure focus on AI's theoretical potential to substitute or complement human labour on the basis of technical feasibility.<n>We introduce the AI Startup Exposure (AISE) index-a novel metric based on occupational descriptions from O*NET and AI applications developed by startups.<n>Our findings suggest that AI adoption will be gradual and shaped by social factors as much as by the technical feasibility of AI applications.
arXiv Detail & Related papers (2024-12-06T10:25:05Z)
ML Research Benchmark [0.0]
We present the ML Research Benchmark (MLRB), comprising 7 competition-level tasks derived from recent machine learning conference tracks. This paper introduces a novel benchmark and evaluates it using agent scaffolds powered by frontier models, including Claude-3 and GPT-4o. The results indicate that the Claude-3.5 Sonnet agent performs best across our benchmark, excelling in planning and developing machine learning models.
arXiv Detail & Related papers (2024-10-29T21:38:42Z)
Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges. We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow. We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z)
Can a GPT4-Powered AI Agent Be a Good Enough Performance Attribution Analyst? [0.0]
This study introduces the application of an AI Agent for a variety of essential performance attribution tasks. It achieves accuracy rates exceeding 93% in analyzing performance drivers, attains 100% in multi-level attribution calculations, and surpasses 84% accuracy in QA exercises that simulate official examination standards.
arXiv Detail & Related papers (2024-03-15T17:12:57Z)
PADTHAI-MM: Principles-based Approach for Designing Trustworthy, Human-centered AI using MAST Methodology [5.215782336985273]
The Multisource AI Scorecard Table (MAST) was designed to bridge the gap by offering a systematic, tradecraft-centered approach to evaluating AI-enabled decision support systems.<n>We introduce an iterative design framework called textitPrinciples-based Approach for Designing Trustworthy, Human-centered AI.<n>We demonstrate this framework in our development of the Reporting Assistant for Defense and Intelligence Tasks (READIT)
arXiv Detail & Related papers (2024-01-24T23:15:44Z)
Watch-And-Help: A Challenge for Social Perception and Human-AI Collaboration [116.28433607265573]
We introduce Watch-And-Help (WAH), a challenge for testing social intelligence in AI agents. In WAH, an AI agent needs to help a human-like agent perform a complex household task efficiently. We build VirtualHome-Social, a multi-agent household environment, and provide a benchmark including both planning and learning based baselines.
arXiv Detail & Related papers (2020-10-19T21:48:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.