Improving the State of the Art for Training Human-AI Teams: Technical
Report #3 -- Analysis of Testbed Alternatives
- URL: http://arxiv.org/abs/2309.03213v1
- Date: Tue, 29 Aug 2023 14:06:30 GMT
- Title: Improving the State of the Art for Training Human-AI Teams: Technical
Report #3 -- Analysis of Testbed Alternatives
- Authors: Lillian Asiala, James E. McCarthy, Lixiao Huang
- Abstract summary: Sonalysts is working on an initiative to expand its expertise in teaming to Human-Artificial Intelligence (AI) teams.
To provide a foundation for that research, Sonalysts is investigating the development of a Synthetic Task Environment.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sonalysts is working on an initiative to expand our current expertise in
teaming to Human-Artificial Intelligence (AI) teams by developing original
research in this area. To provide a foundation for that research, Sonalysts is
investigating the development of a Synthetic Task Environment (STE). In a
previous report, we documented the findings of a recent outreach effort in
which we asked military Subject Matter Experts (SMEs) and other researchers in
the Human-AI teaming domain to identify the qualities that they most valued in
a testbed. A surprising finding from that outreach was that several respondents
recommended that our team look into existing human-AI teaming testbeds, rather
than creating something new. Based on that recommendation, we conducted a
systematic investigation of the associated landscape. In this report, we
describe the results of that investigation. Building on the survey results, we
developed testbed evaluation criteria, identified potential testbeds, and
conducted qualitative and quantitative evaluations of candidate testbeds. The
evaluation process led to five candidate testbeds for the research team to
consider. In the coming months, we will assess the viability of the various
alternatives and begin to execute our program of research.
Related papers
- The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research [56.80927148740585]
We address the challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators.<n>We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent.<n>Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
arXiv Detail & Related papers (2026-02-05T19:00:02Z) - From Task Executors to Research Partners: Evaluating AI Co-Pilots Through Workflow Integration in Biomedical Research [0.16174969956296248]
This rapid review examines benchmarking practices for AI systems in preclinical biomedical research.<n>A process-oriented evaluation framework is proposed that addresses four critical dimensions absent from current benchmarks.<n>These dimensions are essential for evaluating AI systems as research co-pilots rather than as isolated task executors.
arXiv Detail & Related papers (2025-12-04T14:37:46Z) - AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite [75.58737079136942]
We present AstaBench, a suite that provides the first holistic measure of agentic ability to perform scientific research.<n>Our suite comes with the first scientific research environment with production-grade search tools.<n>Our evaluation of 57 agents across 22 agent classes reveals several interesting findings.
arXiv Detail & Related papers (2025-10-24T17:10:26Z) - Towards Personalized Deep Research: Benchmarks and Evaluations [56.581105664044436]
We introduce Personalized Deep Research Bench, the first benchmark for evaluating personalization in Deep Research Agents (DRAs)<n>It pairs 50 diverse research tasks with 25 authentic user profiles that combine structured persona attributes with dynamic real-world contexts, yielding 250 realistic user-task queries.<n>Our experiments on a range of systems highlight current capabilities and limitations in handling personalized deep research.
arXiv Detail & Related papers (2025-09-29T17:39:17Z) - ResearcherBench: Evaluating Deep AI Research Systems on the Frontiers of Scientific Inquiry [22.615102398311432]
We introduce ResearcherBench, the first benchmark focused on evaluating the capabilities of deep AI research systems.<n>We compiled a dataset of 65 research questions expertly selected from real-world scientific scenarios.<n>OpenAI Deep Research and Gemini Deep Research significantly outperform other systems, with particular strength in open-ended consulting questions.
arXiv Detail & Related papers (2025-07-22T06:51:26Z) - AI4Research: A Survey of Artificial Intelligence for Scientific Research [55.5452803680643]
We present a comprehensive survey on AI for Research (AI4Research)<n>We first introduce a systematic taxonomy to classify five mainstream tasks in AI4Research.<n>We identify key research gaps and highlight promising future directions.
arXiv Detail & Related papers (2025-07-02T17:19:20Z) - SciArena: An Open Evaluation Platform for Foundation Models in Scientific Literature Tasks [87.29946641069068]
We present SciArena, an open and collaborative platform for evaluating foundation models on scientific literature tasks.<n>By leveraging collective intelligence, SciArena offers a community-driven evaluation of model performance on open-ended scientific tasks.<n>We release SciArena-Eval, a meta-evaluation benchmark based on our collected preference data.
arXiv Detail & Related papers (2025-07-01T17:51:59Z) - The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - On Benchmarking Human-Like Intelligence in Machines [77.55118048492021]
We argue that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities.
We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks.
arXiv Detail & Related papers (2025-02-27T20:21:36Z) - Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation [58.064940977804596]
A plethora of new AI models and tools has been proposed, promising to empower researchers and academics worldwide to conduct their research more effectively and efficiently.
Ethical concerns regarding shortcomings of these tools and potential for misuse take a particularly prominent place in our discussion.
arXiv Detail & Related papers (2025-02-07T18:26:45Z) - A Decade of Action Quality Assessment: Largest Systematic Survey of Trends, Challenges, and Future Directions [8.27542607031299]
Action Quality Assessment (AQA) has far-reaching implications in areas such as low-cost physiotherapy, sports training, and workforce development.
We systematically review over 200 research papers using the preferred reporting items for systematic reviews & meta-analyses (PRISMA) framework.
This survey provides a detailed analysis of research trends, performance comparisons, challenges, & future directions.
arXiv Detail & Related papers (2025-02-05T01:33:24Z) - On Evaluating Explanation Utility for Human-AI Decision Making in NLP [39.58317527488534]
We review existing metrics suitable for application-grounded evaluation.
We demonstrate the importance of reassessing the state of the art to form and study human-AI teams.
arXiv Detail & Related papers (2024-07-03T23:53:27Z) - ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models [56.08917291606421]
ResearchAgent is an AI-based system for ideation and operationalization of novel work.
ResearchAgent automatically defines novel problems, proposes methods and designs experiments, while iteratively refining them.
We experimentally validate our ResearchAgent on scientific publications across multiple disciplines.
arXiv Detail & Related papers (2024-04-11T13:36:29Z) - SurveyAgent: A Conversational System for Personalized and Efficient Research Survey [50.04283471107001]
This paper introduces SurveyAgent, a novel conversational system designed to provide personalized and efficient research survey assistance to researchers.
SurveyAgent integrates three key modules: Knowledge Management for organizing papers, Recommendation for discovering relevant literature, and Query Answering for engaging with content on a deeper level.
Our evaluation demonstrates SurveyAgent's effectiveness in streamlining research activities, showcasing its capability to facilitate how researchers interact with scientific literature.
arXiv Detail & Related papers (2024-04-09T15:01:51Z) - Search-Based Fairness Testing: An Overview [4.453735522794044]
biases in AI systems raise ethical and societal concerns.
This paper reviews current research on fairness testing, particularly its application through search-based testing.
arXiv Detail & Related papers (2023-11-10T16:47:56Z) - Improving the State of the Art for Training Human-AI Teams: Technical
Report #2 -- Results of Researcher Knowledge Elicitation Survey [0.0]
Sonalysts has begun an internal initiative to explore the training of Human-AI teams.
The first step in this effort is to develop a Synthetic Task Environment (STE) that is capable of facilitating research on Human-AI teams.
arXiv Detail & Related papers (2023-08-29T13:54:32Z) - Improving the State of the Art for Training Human-AI Teams: Technical
Report #1 -- Results of Subject-Matter Expert Knowledge Elicitation Survey [0.0]
Sonalysts has begun an internal initiative to explore the training of human-AI teams.
We decided to use Joint All-Domain Command and Control (JADC2) as a focus point.
We engaged a number of Subject-Matter Experts (SMEs) with Command and Control experience to gain insight into developing a STE that embodied the teaming challenges associated with JADC2.
arXiv Detail & Related papers (2023-08-29T13:42:52Z) - ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate [57.71597869337909]
We build a multi-agent referee team called ChatEval to autonomously discuss and evaluate the quality of generated responses from different models.
Our analysis shows that ChatEval transcends mere textual scoring, offering a human-mimicking evaluation process for reliable assessments.
arXiv Detail & Related papers (2023-08-14T15:13:04Z) - Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Survey of Aspect-based Sentiment Analysis Datasets [55.61047894397937]
Aspect-based sentiment analysis (ABSA) is a natural language processing problem that requires analyzing user-generated reviews.
Numerous yet scattered corpora for ABSA make it difficult for researchers to identify corpora best suited for a specific ABSA subtask quickly.
This study aims to present a database of corpora that can be used to train and assess autonomous ABSA systems.
arXiv Detail & Related papers (2022-04-11T16:23:36Z) - An Uncommon Task: Participatory Design in Legal AI [64.54460979588075]
We examine a notable yet understudied AI design process in the legal domain that took place over a decade ago.
We show how an interactive simulation methodology allowed computer scientists and lawyers to become co-designers.
arXiv Detail & Related papers (2022-03-08T15:46:52Z) - Scaling up Search Engine Audits: Practical Insights for Algorithm
Auditing [68.8204255655161]
We set up experiments for eight search engines with hundreds of virtual agents placed in different regions.
We demonstrate the successful performance of our research infrastructure across multiple data collections.
We conclude that virtual agents are a promising venue for monitoring the performance of algorithms across long periods of time.
arXiv Detail & Related papers (2021-06-10T15:49:58Z) - Human-AI Symbiosis: A Survey of Current Approaches [18.252264744963394]
We highlight various aspects of works on the human-AI team such as the flow of complementing, task horizon, model representation, knowledge level, and teaming goal.
We hope that the survey will provide a more clear connection between the works in the human-AI team and guidance to new researchers in this area.
arXiv Detail & Related papers (2021-03-18T02:39:28Z) - Robustness Gym: Unifying the NLP Evaluation Landscape [91.80175115162218]
Deep neural networks are often brittle when deployed in real-world systems.
Recent research has focused on testing the robustness of such models.
We propose a solution in the form of Robustness Gym, a simple and evaluation toolkit.
arXiv Detail & Related papers (2021-01-13T02:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.