A Machine Learning Approach to Classifying Construction Cost Documents
into the International Construction Measurement Standard
- URL: http://arxiv.org/abs/2211.07705v1
- Date: Mon, 24 Oct 2022 11:35:53 GMT
- Title: A Machine Learning Approach to Classifying Construction Cost Documents
into the International Construction Measurement Standard
- Authors: J. Ignacio Deza, Hisham Ihshaish and Lamine Mahdjoubi
- Abstract summary: We introduce the first automated models for classifying natural language descriptions provided in cost documents called "Bills of Quantities"
We learn from a dataset of more than 50 thousand descriptions of items retrieved from 24 large infrastructure construction projects across the United Kingdom.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the first automated models for classifying natural language
descriptions provided in cost documents called "Bills of Quantities" (BoQs)
popular in the infrastructure construction industry, into the International
Construction Measurement Standard (ICMS). The models we deployed and
systematically evaluated for multi-class text classification are learnt from a
dataset of more than 50 thousand descriptions of items retrieved from 24 large
infrastructure construction projects across the United Kingdom. We describe our
approach to language representation and subsequent modelling to examine the
strength of contextual semantics and temporal dependency of language used in
construction project documentation. To do that we evaluate two experimental
pipelines to inferring ICMS codes from text, on the basis of two different
language representation models and a range of state-of-the-art sequence-based
classification methods, including recurrent and convolutional neural network
architectures. The findings indicate a highly effective and accurate ICMS
automation model is within reach, with reported accuracy results above 90% F1
score on average, on 32 ICMS categories. Furthermore, due to the specific
nature of language use in the BoQs text; short, largely descriptive and
technical, we find that simpler models compare favourably to achieving higher
accuracy results. Our analysis suggest that information is more likely embedded
in local key features in the descriptive text, which explains why a simpler
generic temporal convolutional network (TCN) exhibits comparable memory to
recurrent architectures with the same capacity, and subsequently outperforms
these at this task.
Related papers
- Neural Architecture Search for Sentence Classification with BERT [4.862490782515929]
We perform an AutoML search to find architectures that outperform the current single layer at only a small compute cost.
We validate our classification architecture on a variety of NLP benchmarks from the GLUE dataset.
arXiv Detail & Related papers (2024-03-27T13:25:43Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Unifying Structure and Language Semantic for Efficient Contrastive
Knowledge Graph Completion with Structured Entity Anchors [0.3913403111891026]
The goal of knowledge graph completion (KGC) is to predict missing links in a KG using trained facts that are already known.
We propose a novel method to effectively unify structure information and language semantics without losing the power of inductive reasoning.
arXiv Detail & Related papers (2023-11-07T11:17:55Z) - LLM2KB: Constructing Knowledge Bases using instruction tuned context
aware Large Language Models [0.8702432681310401]
Our paper proposes LLM2KB, a system for constructing knowledge bases using large language models.
Our best performing model achieved an average F1 score of 0.6185 across 21 relations in the LM-KBC challenge held at the ISWC 2023 conference.
arXiv Detail & Related papers (2023-08-25T07:04:16Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - Unified Text Structuralization with Instruction-tuned Language Models [28.869098023025753]
We propose a simple and efficient approach to instruct large language model (LLM) to extract a variety of structures from texts.
Experiments show that this approach can enable language models to perform comparable with other state-of-the-art methods on datasets of a variety of languages and knowledge.
arXiv Detail & Related papers (2023-03-27T07:39:05Z) - DeepStruct: Pretraining of Language Models for Structure Prediction [64.84144849119554]
We pretrain language models on a collection of task-agnostic corpora to generate structures from text.
Our structure pretraining enables zero-shot transfer of the learned knowledge that models have about the structure tasks.
We show that a 10B parameter language model transfers non-trivially to most tasks and obtains state-of-the-art performance on 21 of 28 datasets.
arXiv Detail & Related papers (2022-05-21T00:58:22Z) - CUGE: A Chinese Language Understanding and Generation Evaluation
Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing.
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z) - Leveraging Advantages of Interactive and Non-Interactive Models for
Vector-Based Cross-Lingual Information Retrieval [12.514666775853598]
We propose a novel framework to leverage the advantages of interactive and non-interactive models.
We introduce semi-interactive mechanism, which builds our model upon non-interactive architecture but encodes each document together with its associated multilingual queries.
Our methods significantly boost the retrieval accuracy while maintaining the computational efficiency.
arXiv Detail & Related papers (2021-11-03T03:03:19Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Towards Making the Most of Context in Neural Machine Translation [112.9845226123306]
We argue that previous research did not make a clear use of the global context.
We propose a new document-level NMT framework that deliberately models the local context of each sentence.
arXiv Detail & Related papers (2020-02-19T03:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.