CODE-ACCORD: A Corpus of building regulatory data for rule generation towards automatic compliance checking
- URL: http://arxiv.org/abs/2403.02231v4
- Date: Tue, 18 Feb 2025 11:00:19 GMT
- Title: CODE-ACCORD: A Corpus of building regulatory data for rule generation towards automatic compliance checking
- Authors: Hansi Hettiarachchi, Amna Dridi, Mohamed Medhat Gaber, Pouyan Parsafard, Nicoleta Bocaneala, Katja Breitenfelder, Gonçal Costa, Maria Hedblom, Mihaela Juganaru-Mathieu, Thamer Mecharnia, Sumee Park, He Tan, Abdel-Rahman H. Tawil, Edlira Vakaj,
- Abstract summary: CODE-ACCORD is a dataset of 862 sentences from the building regulations of England and Finland.
It supports a range of ML and Natural Language Processing (NLP) tasks, including text classification, entity recognition, and relation extraction.
- Score: 1.9950441865030422
- License:
- Abstract: Automatic Compliance Checking (ACC) within the Architecture, Engineering, and Construction (AEC) sector necessitates automating the interpretation of building regulations to achieve its full potential. Converting textual rules into machine-readable formats is challenging due to the complexities of natural language and the scarcity of resources for advanced Machine Learning (ML). Addressing these challenges, we introduce CODE-ACCORD, a dataset of 862 sentences from the building regulations of England and Finland. Only the self-contained sentences, which express complete rules without needing additional context, were considered as they are essential for ACC. Each sentence was manually annotated with entities and relations by a team of 12 annotators to facilitate machine-readable rule generation, followed by careful curation to ensure accuracy. The final dataset comprises 4,297 entities and 4,329 relations across various categories, serving as a robust ground truth. CODE-ACCORD supports a range of ML and Natural Language Processing (NLP) tasks, including text classification, entity recognition, and relation extraction. It enables applying recent trends, such as deep neural networks and large language models, to ACC.
Related papers
- Using Large Language Models for the Interpretation of Building Regulations [7.013802453969655]
Large language models (LLMs) can generate logically coherent text and source code responding to user prompts.
This paper evaluates the performance of LLMs in translating building regulations into LegalRuleML in a few-shot learning setup.
arXiv Detail & Related papers (2024-07-26T08:30:47Z) - SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark.
SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets.
The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z) - A Text Classification-Based Approach for Evaluating and Enhancing the
Machine Interpretability of Building Codes [9.730183895717056]
This research aims to propose a novel approach to automatically evaluate and enhance the machine interpretability of single clause and building codes.
Experiments show that the proposed text classification algorithm outperforms the existing CNN- or RNN-based methods.
analyzing the results of more than 150 building codes in China showed that their average interpretability is 34.40%.
arXiv Detail & Related papers (2023-09-24T11:36:21Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - COLLIE: Systematic Construction of Constrained Text Generation Tasks [33.300039566331876]
COLLIE is a grammar-based framework that allows the specification of rich, compositional constraints with diverse generation levels.
We develop tools for automatic extraction of task instances given a constraint structure and a raw text corpus.
We perform systematic experiments across five state-of-the-art instruction-tuned language models and analyze their performances to reveal shortcomings.
arXiv Detail & Related papers (2023-07-17T17:48:51Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - Controlled Text Generation with Natural Language Instructions [74.88938055638636]
InstructCTG is a controlled text generation framework that incorporates different constraints.
We first extract the underlying constraints of natural texts through a combination of off-the-shelf NLP tools and simple verbalizes.
By prepending natural language descriptions of the constraints and a few demonstrations, we fine-tune a pre-trained language model to incorporate various types of constraints.
arXiv Detail & Related papers (2023-04-27T15:56:34Z) - SPaR.txt, a cheap Shallow Parsing approach for Regulatory texts [6.656036869700669]
This study introduces a shallow parsing task for which training data is relatively cheap to create.
We show through manual evaluation that the model identifies most (89,84%) defined terms in a set of building regulation documents.
arXiv Detail & Related papers (2021-10-04T10:00:22Z) - Lexically-constrained Text Generation through Commonsense Knowledge
Extraction and Injection [62.071938098215085]
We focus on the Commongen benchmark, wherein the aim is to generate a plausible sentence for a given set of input concepts.
We propose strategies for enhancing the semantic correctness of the generated text.
arXiv Detail & Related papers (2020-12-19T23:23:40Z) - Automatic Extraction of Rules Governing Morphological Agreement [103.78033184221373]
We develop an automated framework for extracting a first-pass grammatical specification from raw text.
We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world's languages.
We apply our framework to all languages included in the Universal Dependencies project, with promising results.
arXiv Detail & Related papers (2020-10-02T18:31:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.