DisasterQA: A Benchmark for Assessing the performance of LLMs in Disaster Response
- URL: http://arxiv.org/abs/2410.20707v1
- Date: Wed, 09 Oct 2024 00:13:06 GMT
- Title: DisasterQA: A Benchmark for Assessing the performance of LLMs in Disaster Response
- Authors: Rajat Rawat,
- Abstract summary: We evaluate the capabilities of Large Language Models (LLMs) in disaster response knowledge.
The benchmark covers a wide range of disaster response topics.
The results indicate that LLMs require improvement on disaster response knowledge.
- Score: 0.0
- License:
- Abstract: Disasters can result in the deaths of many, making quick response times vital. Large Language Models (LLMs) have emerged as valuable in the field. LLMs can be used to process vast amounts of textual information quickly providing situational context during a disaster. However, the question remains whether LLMs should be used for advice and decision making in a disaster. To evaluate the capabilities of LLMs in disaster response knowledge, we introduce a benchmark: DisasterQA created from six online sources. The benchmark covers a wide range of disaster response topics. We evaluated five LLMs each with four different prompting methods on our benchmark, measuring both accuracy and confidence levels through Logprobs. The results indicate that LLMs require improvement on disaster response knowledge. We hope that this benchmark pushes forth further development of LLMs in disaster response, ultimately enabling these models to work alongside. emergency managers in disasters.
Related papers
- Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks.
We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z) - CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs)
Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs.
Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z) - Monitoring Critical Infrastructure Facilities During Disasters Using Large Language Models [8.17728833322492]
Critical Infrastructure Facilities (CIFs) are vital for the functioning of a community, especially during large-scale emergencies.
In this paper, we explore a potential application of Large Language Models (LLMs) to monitor the status of CIFs affected by natural disasters through information disseminated in social media networks.
We analyze social media data from two disaster events in two different countries to identify reported impacts to CIFs as well as their impact severity and operational status.
arXiv Detail & Related papers (2024-04-18T19:41:05Z) - Evaluation and Improvement of Fault Detection for Large Language Models [30.760472387136954]
This paper investigates the effectiveness of existing fault detection methods for large language models (LLMs)
We propose textbfMuCS, a prompt textbfMutation-based prediction textbfConfidence textbfSmoothing framework to boost the fault detection capability of existing methods.
arXiv Detail & Related papers (2024-04-14T07:06:12Z) - PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - Fact-and-Reflection (FaR) Improves Confidence Calibration of Large Language Models [84.94220787791389]
We propose Fact-and-Reflection (FaR) prompting, which improves the LLM calibration in two steps.
Experiments show that FaR achieves significantly better calibration; it lowers the Expected Error by 23.5%.
FaR even elicits the capability of verbally expressing concerns in less confident scenarios.
arXiv Detail & Related papers (2024-02-27T01:37:23Z) - Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method [36.24876571343749]
Large Language Models (LLMs) have shown great potential in Natural Language Processing (NLP) tasks.
Recent literature reveals that LLMs generate nonfactual responses intermittently.
We propose a novel self-detection method to detect which questions that a LLM does not know that are prone to generate nonfactual results.
arXiv Detail & Related papers (2023-10-27T06:22:14Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z) - Exploring the Responses of Large Language Models to Beginner
Programmers' Help Requests [1.8260333137469122]
We assess how good large language models (LLMs) are at identifying issues in problematic code that students request help on.
We collected a sample of help requests and code from an online programming course.
arXiv Detail & Related papers (2023-06-09T07:19:43Z) - On the Risk of Misinformation Pollution with Large Language Models [127.1107824751703]
We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation.
Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation in the performance of Open-Domain Question Answering (ODQA) systems.
arXiv Detail & Related papers (2023-05-23T04:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.