Related papers: Extracting Structured Requirements from Unstructured Building Technical Specifications for Building Information Modeling

Extracting Structured Requirements from Unstructured Building Technical Specifications for Building Information Modeling

URL: http://arxiv.org/abs/2508.13833v1
Date: Tue, 19 Aug 2025 13:55:41 GMT
Title: Extracting Structured Requirements from Unstructured Building Technical Specifications for Building Information Modeling
Authors: Insaf Nahri, Romain Pinquié, Philippe Véron, Nicolas Bus, Mathieu Thorel,
Abstract summary: This study explores the integration of Building Information Modeling with Natural Language Processing (NLP)<n>It aims to automate the extraction of requirements from unstructured French Building Technical Specification documents within the construction industry.<n>The results indicate that CamemBERT and Fr_core_news_lg exhibited superior performance in NER, achieving F1-scores over 90%, while Random Forest proved most effective in RE, with an F1 score above 80%.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: This study explores the integration of Building Information Modeling (BIM) with Natural Language Processing (NLP) to automate the extraction of requirements from unstructured French Building Technical Specification (BTS) documents within the construction industry. Employing Named Entity Recognition (NER) and Relation Extraction (RE) techniques, the study leverages the transformer-based model CamemBERT and applies transfer learning with the French language model Fr\_core\_news\_lg, both pre-trained on a large French corpus in the general domain. To benchmark these models, additional approaches ranging from rule-based to deep learning-based methods are developed. For RE, four different supervised models, including Random Forest, are implemented using a custom feature vector. A hand-crafted annotated dataset is used to compare the effectiveness of NER approaches and RE models. Results indicate that CamemBERT and Fr\_core\_news\_lg exhibited superior performance in NER, achieving F1-scores over 90\%, while Random Forest proved most effective in RE, with an F1 score above 80\%. The outcomes are intended to be represented as a knowledge graph in future work to further enhance automatic verification systems.

Related papers

Kastor: Fine-tuned Small Language Models for Shape-based Active Relation Extraction [0.0]
RDF pattern-based extraction is a compelling approach for fine-tuning small language models.<n>We introduce Kastor, a framework that advances this approach to meet the demands for completing and refining knowledge bases.
arXiv Detail & Related papers (2025-11-05T13:43:47Z)
ORIGAMI: A generative transformer architecture for predictions from semi-structured data [3.5639148953570836]
ORIGAMI is a transformer-based architecture that processes nested key/value pairs.<n>By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks.
arXiv Detail & Related papers (2024-12-23T07:21:17Z)
High-Performance Few-Shot Segmentation with Foundation Models: An Empirical Study [64.06777376676513]
We develop a few-shot segmentation (FSS) framework based on foundation models. To be specific, we propose a simple approach to extract implicit knowledge from foundation models to construct coarse correspondence. Experiments on two widely used datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-10T08:04:11Z)
Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction. We reformulate the task to be entity-centric, enabling the use of diverse metrics. We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z)
Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models. We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models. Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z)
Federated Learning Aggregation: New Robust Algorithms with Guarantees [63.96013144017572]
Federated learning has been recently proposed for distributed model training at the edge. This paper presents a complete general mathematical convergence analysis to evaluate aggregation strategies in a federated learning framework. We derive novel aggregation algorithms which are able to modify their model architecture by differentiating client contributions according to the value of their losses.
arXiv Detail & Related papers (2022-05-22T16:37:53Z)
Leveraging Advantages of Interactive and Non-Interactive Models for Vector-Based Cross-Lingual Information Retrieval [12.514666775853598]
We propose a novel framework to leverage the advantages of interactive and non-interactive models. We introduce semi-interactive mechanism, which builds our model upon non-interactive architecture but encodes each document together with its associated multilingual queries. Our methods significantly boost the retrieval accuracy while maintaining the computational efficiency.
arXiv Detail & Related papers (2021-11-03T03:03:19Z)
AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures. We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS. Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z)
Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings. We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data. We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z)
Application of Pre-training Models in Named Entity Recognition [5.285449619478964]
We introduce the architecture and pre-training tasks of four common pre-training models: BERT, ERNIE, ERNIE2.0-tiny, and RoBERTa. We apply these pre-training models to a NER task by fine-tuning, and compare the effects of the different model architecture and pre-training tasks on the NER task. Experiment results showed that RoBERTa achieved state-of-the-art results on the MSRA-2006 dataset.
arXiv Detail & Related papers (2020-02-09T08:18:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.