Related papers: OrgAccess: A Benchmark for Role Based Access Control in Organization Scale LLMs

OrgAccess: A Benchmark for Role Based Access Control in Organization Scale LLMs

URL: http://arxiv.org/abs/2505.19165v3
Date: Tue, 17 Jun 2025 16:48:29 GMT
Title: OrgAccess: A Benchmark for Role Based Access Control in Organization Scale LLMs
Authors: Debdeep Sanyal, Umakanta Maharana, Yash Sinha, Hong Ming Tan, Shirish Karande, Mohan Kankanhalli, Murari Mandal,
Abstract summary: Large Language Models (LLMs) serve as unified knowledge repositories and intelligent assistants in enterprise settings.<n> evaluating this crucial capability is inherently difficult due to the proprietary and sensitive nature of real-world corporate data and access control policies.<n>We introduce a synthetic yet representative textbfOrgAccess benchmark consisting of 40 distinct types of permissions commonly relevant across different organizational roles and levels.
Score: 7.999158988904784
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Role-based access control (RBAC) and hierarchical structures are foundational to how information flows and decisions are made within virtually all organizations. As the potential of Large Language Models (LLMs) to serve as unified knowledge repositories and intelligent assistants in enterprise settings becomes increasingly apparent, a critical, yet under explored, challenge emerges: \textit{can these models reliably understand and operate within the complex, often nuanced, constraints imposed by organizational hierarchies and associated permissions?} Evaluating this crucial capability is inherently difficult due to the proprietary and sensitive nature of real-world corporate data and access control policies. We introduce a synthetic yet representative \textbf{OrgAccess} benchmark consisting of 40 distinct types of permissions commonly relevant across different organizational roles and levels. We further create three types of permissions: 40,000 easy (1 permission), 10,000 medium (3-permissions tuple), and 20,000 hard (5-permissions tuple) to test LLMs' ability to accurately assess these permissions and generate responses that strictly adhere to the specified hierarchical rules, particularly in scenarios involving users with overlapping or conflicting permissions. Our findings reveal that even state-of-the-art LLMs struggle significantly to maintain compliance with role-based structures, even with explicit instructions, with their performance degrades further when navigating interactions involving two or more conflicting permissions. Specifically, even \textbf{GPT-4.1 only achieves an F1-Score of 0.27 on our hardest benchmark}. This demonstrates a critical limitation in LLMs' complex rule following and compositional reasoning capabilities beyond standard factual or STEM-based benchmarks, opening up a new paradigm for evaluating their fitness for practical, structured environments.

Related papers

Role-Aware Language Models for Secure and Contextualized Access Control in Organizations [4.122315998598296]
Large language models (LLMs) are increasingly deployed in enterprise settings.<n>We investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles.
arXiv Detail & Related papers (2025-07-31T11:41:04Z)
Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [75.26829371493189]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z)
ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness [2.5967788365637103]
Large language models (LLMs) are increasingly valuable to corporate data management due to their ability to process text from various document formats.<n>This work establishes a foundation for sensitivity-aware language models and provides insights to enhance privacy-centric AI systems in corporate environments.
arXiv Detail & Related papers (2025-06-01T11:24:23Z)
Permissioned LLMs: Enforcing Access Control in Large Language Models [14.935672762016972]
Permissioned LLMs (PerLM) superimpose organizational data access control structures on query responses.<n>PermLLM mechanisms build on Efficient Fine-Tuning to achieve the desired access control.<n>We demonstrate the efficacy of our PermLLM mechanisms through extensive experiments on four public datasets.
arXiv Detail & Related papers (2025-05-28T20:47:02Z)
HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning [25.088407009353162]
Existing benchmarks for structure reasoning mainly focus on horizontal and coordinate structures.<n>HiBench is the first framework spanning from initial structure generation to final proficiency assessment.<n>It consists of 30 tasks with varying hierarchical complexity, totaling 39,519 queries.
arXiv Detail & Related papers (2025-03-02T14:25:37Z)
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios [58.90106984375913]
RuleArena is a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning.<n> Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions.
arXiv Detail & Related papers (2024-12-12T06:08:46Z)
Benchmarking Complex Instruction-Following with Multiple Constraints Composition [72.82640456309821]
How to evaluate the ability of complex instruction-following of large language models (LLMs) has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints. We propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints.
arXiv Detail & Related papers (2024-07-04T14:50:45Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models [46.07900122810749]
Large language models (LLMs) have achieved unprecedented performances in various applications, yet evaluating them is still challenging. We contend that utilizing existing relational databases is a promising approach for constructing benchmarks. We propose ERBench, which uses these integrity constraints to convert any database into an LLM benchmark.
arXiv Detail & Related papers (2024-03-08T12:42:36Z)
DePLOI: Applying NL2SQL to Synthesize and Audit Database Access Control [6.2859996652179]
This paper introduces a new access control model called Intent-Based Access Control for Databases (IBAC-DB)<n>In IBAC-DB, access control policies are expressed using abstractions that scale to high numbers of database objects, and are traceable with respect to implementations.<n>This paper proposes DePLOI, a system leveraging access control-specific task decompositions to accurately synthesize and audit access control implementation from IBAC-DB abstractions.
arXiv Detail & Related papers (2024-02-11T23:50:12Z)
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models. We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z)
Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.