Related papers: LogiDebrief: A Signal-Temporal Logic based Automated Debriefing Approach with Large Language Models Integration

LogiDebrief: A Signal-Temporal Logic based Automated Debriefing Approach with Large Language Models Integration

URL: http://arxiv.org/abs/2505.03985v1
Date: Tue, 06 May 2025 21:27:07 GMT
Title: LogiDebrief: A Signal-Temporal Logic based Automated Debriefing Approach with Large Language Models Integration
Authors: Zirong Chen, Ziyan An, Jennifer Reynolds, Kristin Mullen, Stephen Martini, Meiyi Ma,
Abstract summary: We introduce LogiDebrief, an AI-driven framework that automates human-led evaluations of 9-1-1 call-takers.<n>LogiDebrief formalizes call-taking requirements as logical specifications, enabling systematic assessment of 9-1-1 calls.<n>It has assisted in debriefing 1,701 real-world calls, saving 311.85 hours of active engagement.
Score: 2.1074375725054697
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Emergency response services are critical to public safety, with 9-1-1 call-takers playing a key role in ensuring timely and effective emergency operations. To ensure call-taking performance consistency, quality assurance is implemented to evaluate and refine call-takers' skillsets. However, traditional human-led evaluations struggle with high call volumes, leading to low coverage and delayed assessments. We introduce LogiDebrief, an AI-driven framework that automates traditional 9-1-1 call debriefing by integrating Signal-Temporal Logic (STL) with Large Language Models (LLMs) for fully-covered rigorous performance evaluation. LogiDebrief formalizes call-taking requirements as logical specifications, enabling systematic assessment of 9-1-1 calls against procedural guidelines. It employs a three-step verification process: (1) contextual understanding to identify responder types, incident classifications, and critical conditions; (2) STL-based runtime checking with LLM integration to ensure compliance; and (3) automated aggregation of results into quality assurance reports. Beyond its technical contributions, LogiDebrief has demonstrated real-world impact. Successfully deployed at Metro Nashville Department of Emergency Communications, it has assisted in debriefing 1,701 real-world calls, saving 311.85 hours of active engagement. Empirical evaluation with real-world data confirms its accuracy, while a case study and extensive user study highlight its effectiveness in enhancing call-taking performance.

Related papers

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
Large Language Models Assisting Ontology Evaluation [1.099532646524593]
Ontology evaluation through functional requirements is costly, labour-intensive, and error-prone.<n>We introduce OE-Assist, a novel framework designed to assist ontology evaluation through automated and semi-automated verification.
arXiv Detail & Related papers (2025-07-19T09:13:51Z)
Foundation Models for Logistics: Toward Certifiable, Conversational Planning Interfaces [59.80143393787701]
Large language models (LLMs) can handle uncertainty and promise to accelerate replanning while lowering the barrier to entry.<n>We introduce a neurosymbolic framework that pairs the accessibility of natural-language dialogue with verifiable guarantees on goal interpretation.<n>A lightweight model, fine-tuned on just 100 uncertainty-filtered examples, surpasses the zero-shot performance of GPT-4.1 while cutting inference latency by nearly 50%.
arXiv Detail & Related papers (2025-07-15T14:24:01Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z)
Towards Automated Situation Awareness: A RAG-Based Framework for Peacebuilding Reports [2.230742111425553]
This paper introduces a dynamic Retrieval-Augmented Generation (RAG) system that autonomously generates situation awareness reports.<n>Our system constructs query-specific knowledge bases on demand, ensuring timely, relevant, and accurate insights.<n>The system is tested across multiple real-world scenarios, demonstrating its effectiveness in producing coherent, insightful, and actionable reports.
arXiv Detail & Related papers (2025-05-14T16:36:30Z)
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation [52.83870601473094]
Embodied agents exhibit immense potential across a multitude of domains.<n>Existing research predominantly concentrates on the security of general large language models.<n>This paper introduces a novel input moderation framework, meticulously designed to safeguard embodied agents.
arXiv Detail & Related papers (2025-04-22T08:34:35Z)
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models [53.12387628636912]
We propose an automatic evaluation framework that is validated against human annotations.<n>This approach was originally developed for the TREC Question Answering (QA) Track in 2003.<n>We observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants.
arXiv Detail & Related papers (2025-04-21T12:55:06Z)
AgentOrca: A Dual-System Framework to Evaluate Language Agents on Operational Routine and Constraint Adherence [54.317522790545304]
We present AgentOrca, a dual-system framework for evaluating language agents' compliance with operational constraints and routines.<n>Our framework encodes action constraints and routines through both natural language prompts for agents and corresponding executable code serving as ground truth for automated verification.<n>Our findings reveal notable performance gaps among state-of-the-art models, with large reasoning models like o1 demonstrating superior compliance while others show significantly lower performance.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Real-Time Multimodal Cognitive Assistant for Emergency Medical Services [4.669165383466683]
This paper presents CognitiveEMS, an end-to-end wearable cognitive assistant system. It can act as a collaborative virtual partner engaging in the real-time acquisition and analysis of multimodal data from an emergency scene.
arXiv Detail & Related papers (2024-03-11T13:56:57Z)
Attribute Structuring Improves LLM-Based Evaluation of Clinical Text Summaries [56.31117605097345]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.<n>Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.<n>AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z)
Auto311: A Confidence-guided Automated System for Non-emergency Calls [2.025468874117372]
We analyzed 11,796 non-emergency call recordings and developed Auto311, the first automated system to handle 311 non-emergency calls. We used real-world data to evaluate the system's effectiveness and deployability.
arXiv Detail & Related papers (2023-12-19T20:52:04Z)
An Emergency Medical Services Clinical Audit System driven by Named Entity Recognition from Deep Learning [0.0]
We present an automatic audit system based on both the structured and unstructured ambulance case records and clinical notes with a deep neural network-based named entities recognition model. Our approach yielded a named entity recognition model that could reliably identify clinical entities from unstructured paramedic free-text reports.
arXiv Detail & Related papers (2020-07-07T16:32:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.