CODE-ACCORD: A Corpus of building regulatory data for rule generation towards automatic compliance checking
- URL: http://arxiv.org/abs/2403.02231v4
- Date: Tue, 18 Feb 2025 11:00:19 GMT
- Title: CODE-ACCORD: A Corpus of building regulatory data for rule generation towards automatic compliance checking
- Authors: Hansi Hettiarachchi, Amna Dridi, Mohamed Medhat Gaber, Pouyan Parsafard, Nicoleta Bocaneala, Katja Breitenfelder, Gonçal Costa, Maria Hedblom, Mihaela Juganaru-Mathieu, Thamer Mecharnia, Sumee Park, He Tan, Abdel-Rahman H. Tawil, Edlira Vakaj,
- Abstract summary: CODE-ACCORD is a dataset of 862 sentences from the building regulations of England and Finland.<n>It supports a range of ML and Natural Language Processing (NLP) tasks, including text classification, entity recognition, and relation extraction.
- Score: 1.9950441865030422
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Compliance Checking (ACC) within the Architecture, Engineering, and Construction (AEC) sector necessitates automating the interpretation of building regulations to achieve its full potential. Converting textual rules into machine-readable formats is challenging due to the complexities of natural language and the scarcity of resources for advanced Machine Learning (ML). Addressing these challenges, we introduce CODE-ACCORD, a dataset of 862 sentences from the building regulations of England and Finland. Only the self-contained sentences, which express complete rules without needing additional context, were considered as they are essential for ACC. Each sentence was manually annotated with entities and relations by a team of 12 annotators to facilitate machine-readable rule generation, followed by careful curation to ensure accuracy. The final dataset comprises 4,297 entities and 4,329 relations across various categories, serving as a robust ground truth. CODE-ACCORD supports a range of ML and Natural Language Processing (NLP) tasks, including text classification, entity recognition, and relation extraction. It enables applying recent trends, such as deep neural networks and large language models, to ACC.
Related papers
- Text-to-SPARQL Goes Beyond English: Multilingual Question Answering Over Knowledge Graphs through Human-Inspired Reasoning [51.203811759364925]
mKGQAgent breaks down the task of converting natural language questions into SPARQL queries into modular, interpretable subtasks.<n> Evaluated on the DBpedia- and Corporate-based KGQA benchmarks within the Text2SPARQL challenge 2025, our approach took first place among the other participants.
arXiv Detail & Related papers (2025-07-22T19:23:03Z) - KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment [5.295540405828356]
KELPS is an iterative framework for translating, synthesizing, and filtering informal data into formal languages.<n>First, we translate natural language into Knowledge Equations (KEs), a novel language that we designed, theoretically grounded in assertional logic.<n>Next, we convert them to target languages through rigorously defined rules that preserve both syntactic structure and semantic meaning.<n>This process yielded a parallel corpus of over 60,000 problems.
arXiv Detail & Related papers (2025-07-11T15:05:06Z) - Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation [36.166087396386445]
We present Compliance-to-Code, the first large-scale Chinese dataset dedicated to financial regulatory compliance.<n> Covering 1,159 annotated clauses from 361 regulations across ten categories, each clause is modularly structured with four logical elements-subject, condition, constraint, and contextual information-along with regulation relations.<n>We provide deterministic Python code mappings, detailed code reasoning, and code explanations to facilitate automated auditing.
arXiv Detail & Related papers (2025-05-26T10:38:32Z) - PICASO: Permutation-Invariant Context Composition with State Space Models [98.91198288025117]
State Space Models (SSMs) offer a promising solution by allowing a database of contexts to be mapped onto fixed-dimensional states.
We propose a simple mathematical relation derived from SSM dynamics to compose multiple states into one that efficiently approximates the effect of concatenating raw context tokens.
We evaluate our resulting method on WikiText and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4x speedup.
arXiv Detail & Related papers (2025-02-24T19:48:00Z) - RIRAG: Regulatory Information Retrieval and Answer Generation [51.998738311700095]
We introduce a task of generating question-passages pairs, where questions are automatically created and paired with relevant regulatory passages.<n>We create the ObliQA dataset, containing 27,869 questions derived from the collection of Abu Dhabi Global Markets (ADGM) financial regulation documents.<n>We design a baseline Regulatory Information Retrieval and Answer Generation (RIRAG) system and evaluate it with RePASs, a novel evaluation metric.
arXiv Detail & Related papers (2024-09-09T14:44:19Z) - Using Large Language Models for the Interpretation of Building Regulations [7.013802453969655]
Large language models (LLMs) can generate logically coherent text and source code responding to user prompts.
This paper evaluates the performance of LLMs in translating building regulations into LegalRuleML in a few-shot learning setup.
arXiv Detail & Related papers (2024-07-26T08:30:47Z) - SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark.
SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets.
The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z) - A Text Classification-Based Approach for Evaluating and Enhancing the
Machine Interpretability of Building Codes [9.730183895717056]
This research aims to propose a novel approach to automatically evaluate and enhance the machine interpretability of single clause and building codes.
Experiments show that the proposed text classification algorithm outperforms the existing CNN- or RNN-based methods.
analyzing the results of more than 150 building codes in China showed that their average interpretability is 34.40%.
arXiv Detail & Related papers (2023-09-24T11:36:21Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - COLLIE: Systematic Construction of Constrained Text Generation Tasks [33.300039566331876]
COLLIE is a grammar-based framework that allows the specification of rich, compositional constraints with diverse generation levels.
We develop tools for automatic extraction of task instances given a constraint structure and a raw text corpus.
We perform systematic experiments across five state-of-the-art instruction-tuned language models and analyze their performances to reveal shortcomings.
arXiv Detail & Related papers (2023-07-17T17:48:51Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - SPaR.txt, a cheap Shallow Parsing approach for Regulatory texts [6.656036869700669]
This study introduces a shallow parsing task for which training data is relatively cheap to create.
We show through manual evaluation that the model identifies most (89,84%) defined terms in a set of building regulation documents.
arXiv Detail & Related papers (2021-10-04T10:00:22Z) - Lexically-constrained Text Generation through Commonsense Knowledge
Extraction and Injection [62.071938098215085]
We focus on the Commongen benchmark, wherein the aim is to generate a plausible sentence for a given set of input concepts.
We propose strategies for enhancing the semantic correctness of the generated text.
arXiv Detail & Related papers (2020-12-19T23:23:40Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.