The CitizenQuery Benchmark: A Novel Dataset and Evaluation Pipeline for Measuring LLM Performance in Citizen Query Tasks
- URL: http://arxiv.org/abs/2602.04064v1
- Date: Tue, 03 Feb 2026 22:58:09 GMT
- Title: The CitizenQuery Benchmark: A Novel Dataset and Evaluation Pipeline for Measuring LLM Performance in Citizen Query Tasks
- Authors: Neil Majithia, Rajat Shinde, Zo Chapman, Prajun Trital, Jordan Decker, Manil Maskey, Elena Simperl, Nigel Shadbolt,
- Abstract summary: "Citizen queries" are questions asked by an individual about government policies, guidance, and services that are relevant to their circumstances.<n>This represents a compelling use case for Large Language Models that respond to citizen queries with information adapted to a user's context.<n>We introduce CitizenQuery-UK, a benchmark dataset of 22 thousand pairs of citizen queries and responses.
- Score: 8.50465147895087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: "Citizen queries" are questions asked by an individual about government policies, guidance, and services that are relevant to their circumstances, encompassing a range of topics including benefits, taxes, immigration, employment, public health, and more. This represents a compelling use case for Large Language Models (LLMs) that respond to citizen queries with information that is adapted to a user's context and communicated according to their needs. However, in this use case, any misinformation could have severe, negative, likely invisible ramifications for an individual placing their trust in a model's response. To this effect, we introduce CitizenQuery-UK, a benchmark dataset of 22 thousand pairs of citizen queries and responses that have been synthetically generated from the swathes of public information on $gov.uk$ about government in the UK. We present the curation methodology behind CitizenQuery-UK and an overview of its contents. We also introduce a methodology for the benchmarking of LLMs with the dataset, using an adaptation of FActScore to benchmark 11 models for factuality, abstention frequency, and verbosity. We document these results, and interpret them in the context of the public sector, finding that: (i) there are distinct performance profiles across model families, but each is competitive; (ii) high variance undermines utility; (iii) abstention is low and verbosity is high, with implications on reliability; and (iv) more trustworthy AI requires acknowledged "fallibility" in the way it interacts with users. The contribution of our research lies in assessing the trustworthiness of LLMs in citizen query tasks; as we see a world of increasing AI integration into day-to-day life, our benchmark, built entirely on open data, lays the foundations for better evidenced decision-making regarding AI and the public sector.
Related papers
- GISA: A Benchmark for General Information-Seeking Assistant [102.30831921333755]
GISA is a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries.<n>It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization.<n>Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30% exact match score.
arXiv Detail & Related papers (2026-02-09T11:44:15Z) - ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness [2.5967788365637103]
Large language models (LLMs) are increasingly valuable to corporate data management due to their ability to process text from various document formats.<n>This work establishes a foundation for sensitivity-aware language models and provides insights to enhance privacy-centric AI systems in corporate environments.
arXiv Detail & Related papers (2025-06-01T11:24:23Z) - Contextual Integrity in LLMs via Reasoning and Reinforcement Learning [41.795843170879046]
We develop a reinforcement learning framework that instills in models the reasoning necessary to achieve contextual integrity.<n>We show that our method substantially reduces inappropriate information disclosure while maintaining task performance.
arXiv Detail & Related papers (2025-05-29T21:26:21Z) - Evaluating Cultural and Social Awareness of LLM Web Agents [113.49968423990616]
We introduce CASA, a benchmark designed to assess large language models' sensitivity to cultural and social norms.<n>Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations.<n>Experiments show that current LLMs perform significantly better in non-agent environments.
arXiv Detail & Related papers (2024-10-30T17:35:44Z) - Building Understandable Messaging for Policy and Evidence Review (BUMPER) with AI [0.3495246564946556]
We introduce a framework for the use of large language models (LLMs) in Building Understandable Messaging for Policy and Evidence Review (BUMPER)
LLMs are capable of providing interfaces for understanding and synthesizing large databases of diverse media.
We argue that this framework can facilitate accessibility of and confidence in scientific evidence for policymakers.
arXiv Detail & Related papers (2024-06-27T05:03:03Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Leveraging Large Language Models for Topic Classification in the Domain
of Public Affairs [65.9077733300329]
Large Language Models (LLMs) have the potential to greatly enhance the analysis of public affairs documents.
LLMs can be of great use to process domain-specific documents, such as those in the domain of public affairs.
arXiv Detail & Related papers (2023-06-05T13:35:01Z) - Lessons Learned from a Citizen Science Project for Natural Language
Processing [53.48988266271858]
Citizen Science is an alternative to crowdsourcing that is relatively unexplored in the context of NLP.
We conduct an exploratory study into engaging different groups of volunteers in Citizen Science for NLP by re-annotating parts of a pre-existing crowdsourced dataset.
Our results show that this can yield high-quality annotations and attract motivated volunteers, but also requires considering factors such as scalability, participation over time, and legal and ethical issues.
arXiv Detail & Related papers (2023-04-25T14:08:53Z) - FacTeR-Check: Semi-automated fact-checking through Semantic Similarity
and Natural Language Inference [61.068947982746224]
FacTeR-Check enables retrieving fact-checked information, unchecked claims verification and tracking dangerous information over social media.
The architecture is validated using a new dataset called NLI19-SP that is publicly released with COVID-19 related hoaxes and tweets from Spanish social media.
Our results show state-of-the-art performance on the individual benchmarks, as well as producing useful analysis of the evolution over time of 61 different hoaxes.
arXiv Detail & Related papers (2021-10-27T15:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.