Assessing Generative AI value in a public sector context: evidence from a field experiment
- URL: http://arxiv.org/abs/2502.09479v1
- Date: Thu, 13 Feb 2025 16:43:32 GMT
- Title: Assessing Generative AI value in a public sector context: evidence from a field experiment
- Authors: Trevor Fitzpatrick, Seamus Kelly, Patrick Carey, David Walsh, Ruairi Nugent,
- Abstract summary: We find mixed evidence for two types of composite tasks related to document understanding and data analysis.<n>For the Documents task, the treatment group using Gen AI had a 17% improvement in answer quality scores and a 34% improvement in task completion time compared to a control group.<n>For the Data task, we find the Gen AI treatment group experienced a 12% reduction in quality scores and no significant difference in mean completion time compared to the control group.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The emergence of Generative AI (Gen AI) has motivated an interest in understanding how it could be used to enhance productivity across various tasks. We add to research results for the performance impact of Gen AI on complex knowledge-based tasks in a public sector setting. In a pre-registered experiment, after establishing a baseline level of performance, we find mixed evidence for two types of composite tasks related to document understanding and data analysis. For the Documents task, the treatment group using Gen AI had a 17% improvement in answer quality scores (as judged by human evaluators) and a 34% improvement in task completion time compared to a control group. For the Data task, we find the Gen AI treatment group experienced a 12% reduction in quality scores and no significant difference in mean completion time compared to the control group. These results suggest that the benefits of Gen AI may be task and potentially respondent dependent. We also discuss field notes and lessons learned, as well as supplementary insights from a post-trial survey and feedback workshop with participants.
Related papers
- Short-Term Gains, Long-Term Gaps: The Impact of GenAI and Search Technologies on Retention [1.534667887016089]
This study investigates how GenAI (ChatGPT), search engines (Google), and e-textbooks influence student performance across tasks of varying cognitive complexity.<n>ChatGPT and Google groups outperformed the control group in immediate assessments for lower-order cognitive tasks.<n>While AI-driven tools facilitate immediate performance, they do not inherently reinforce long-term retention unless supported by structured learning strategies.
arXiv Detail & Related papers (2025-07-10T00:44:50Z) - Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.69724201080155]
We show that many agentic benchmarks have issues in task setup or reward design.<n>Such issues can lead to under- or overestimation of agents' performance by up to 100% in relative terms.<n>We introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience.
arXiv Detail & Related papers (2025-07-03T17:35:31Z) - The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z) - Evaluating the AI-Lab Intervention: Impact on Student Perception and Use of Generative AI in Early Undergraduate Computer Science Courses [0.0]
Generative AI (GenAI) is rapidly entering computer science education.
Concerns about overreliance coexist with a gap in research on structured scaffolding to guide tool use in formal courses.
This study examines the impact of a dedicated "AI-Lab" intervention on undergraduate students.
arXiv Detail & Related papers (2025-04-30T18:12:42Z) - From G-Factor to A-Factor: Establishing a Psychometric Framework for AI Literacy [1.5031024722977635]
We establish AI literacy as a coherent, measurable construct with significant implications for education, workforce development, and social equity.
Study 1 revealed a dominant latent factor - termed the "A-factor" - that accounts for 44.16% of variance across diverse AI interaction tasks.
Study 2 refined the measurement tool by examining four key dimensions of AI literacy.
Regression analyses identified several significant predictors of AI literacy, including cognitive abilities (IQ), educational background, prior AI experience, and training history.
arXiv Detail & Related papers (2025-03-16T14:51:48Z) - General Scales Unlock AI Evaluation with Explanatory and Predictive Power [57.7995945974989]
benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems.
We introduce general scales for AI evaluation that can explain what common AI benchmarks really measure.
Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate.
arXiv Detail & Related papers (2025-03-09T01:13:56Z) - Towards Decoding Developer Cognition in the Age of AI Assistants [9.887133861477233]
We propose a controlled observational study combining physiological measurements (EEG and eye tracking) with interaction data to examine developers' use of AI-assisted programming tools.<n>We will recruit professional developers to complete programming tasks both with and without AI assistance while measuring their cognitive load and task completion time.
arXiv Detail & Related papers (2025-01-05T23:25:21Z) - CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark [11.794931453828974]
CORE-Bench is a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine)
We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way.
The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks.
arXiv Detail & Related papers (2024-09-17T17:13:19Z) - SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories.
Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development.
We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z) - Measuring Human Contribution in AI-Assisted Content Generation [66.06040950325969]
This study raises the research question of measuring human contribution in AI-assisted content generation.<n>By calculating mutual information between human input and AI-assisted output relative to self-information of AI-assisted output, we quantify the proportional information contribution of humans in content generation.
arXiv Detail & Related papers (2024-08-27T05:56:04Z) - GAIA: Rethinking Action Quality Assessment for AI-Generated Videos [56.047773400426486]
Action quality assessment (AQA) algorithms predominantly focus on actions from real specific scenarios and are pre-trained with normative action features.
We construct GAIA, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective.
Results show that traditional AQA methods, action-related metrics in recent T2V benchmarks, and mainstream video quality methods perform poorly with an average SRCC of 0.454, 0.191, and 0.519, respectively.
arXiv Detail & Related papers (2024-06-10T08:18:07Z) - Experimental Evidence on Negative Impact of Generative AI on Scientific
Learning Outcomes [0.0]
Using AI for summarization significantly improved both quality and output.
Individuals with a robust background in the reading topic and superior reading/writing skills benefitted the most.
arXiv Detail & Related papers (2023-09-23T21:59:40Z) - ADVISE: AI-accelerated Design of Evidence Synthesis for Global
Development [2.6293574825904624]
This study develops an AI agent based on a bidirectional encoder representations from transformers (BERT) model.
We explore the effectiveness of the human-AI hybrid team in accelerating the evidence synthesis process.
Results show that incorporating the BERT-based AI agent into the human team can reduce the human screening effort by 68.5%.
arXiv Detail & Related papers (2023-05-02T01:29:53Z) - A Reinforcement Learning-assisted Genetic Programming Algorithm for Team
Formation Problem Considering Person-Job Matching [70.28786574064694]
A reinforcement learning-assisted genetic programming algorithm (RL-GP) is proposed to enhance the quality of solutions.
The hyper-heuristic rules obtained through efficient learning can be utilized as decision-making aids when forming project teams.
arXiv Detail & Related papers (2023-04-08T14:32:12Z) - Advancing Human-AI Complementarity: The Impact of User Expertise and
Algorithmic Tuning on Joint Decision Making [10.890854857970488]
Many factors can impact success of Human-AI teams, including a user's domain expertise, mental models of an AI system, trust in recommendations, and more.
Our study examined user performance in a non-trivial blood vessel labeling task where participants indicated whether a given blood vessel was flowing or stalled.
Our results show that while recommendations from an AI-Assistant can aid user decision making, factors such as users' baseline performance relative to the AI and complementary tuning of AI error types significantly impact overall team performance.
arXiv Detail & Related papers (2022-08-16T21:39:58Z) - An Exploration of Post-Editing Effectiveness in Text Summarization [58.99765574294715]
"Post-editing" AI-generated text reduces human workload and improves the quality of AI output.
We compare post-editing provided summaries with manual summarization for summary quality, human efficiency, and user experience.
This study sheds valuable insights on when post-editing is useful for text summarization.
arXiv Detail & Related papers (2022-06-13T18:00:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.