OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring
- URL: http://arxiv.org/abs/2508.11482v1
- Date: Fri, 15 Aug 2025 13:56:21 GMT
- Title: OpenConstruction: A Systematic Synthesis of Open Visual Datasets for Data-Centric Artificial Intelligence in Construction Monitoring
- Authors: Ruoxin Xiong, Yanyu Wang, Jiannan Cai, Kaijian Liu, Yuansheng Zhu, Pingbo Tang, Nora El-Gohary,
- Abstract summary: Construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring.<n>Despite growing interest in visual datasets, existing resources vary widely in sizes, quality, and representativeness of real-world construction conditions.<n>This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development.
- Score: 4.795391174842949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The construction industry increasingly relies on visual data to support Artificial Intelligence (AI) and Machine Learning (ML) applications for site monitoring. High-quality, domain-specific datasets, comprising images, videos, and point clouds, capture site geometry and spatiotemporal dynamics, including the location and interaction of objects, workers, and materials. However, despite growing interest in leveraging visual datasets, existing resources vary widely in sizes, data modalities, annotation quality, and representativeness of real-world construction conditions. A systematic review to categorize their data characteristics and application contexts is still lacking, limiting the community's ability to fully understand the dataset landscape, identify critical gaps, and guide future directions toward more effective, reliable, and scalable AI applications in construction. To address this gap, this study conducts an extensive search of academic databases and open-data platforms, yielding 51 publicly available visual datasets that span the 2005-2024 period. These datasets are categorized using a structured data schema covering (i) data fundamentals (e.g., size and license), (ii) data modalities (e.g., RGB and point cloud), (iii) annotation frameworks (e.g., bounding boxes), and (iv) downstream application domains (e.g., progress tracking). This study synthesizes these findings into an open-source catalog, OpenConstruction, supporting data-driven method development. Furthermore, the study discusses several critical limitations in the existing construction dataset landscape and presents a roadmap for future data infrastructure anchored in the Findability, Accessibility, Interoperability, and Reusability (FAIR) principles. By reviewing the current landscape and outlining strategic priorities, this study supports the advancement of data-centric solutions in the construction sector.
Related papers
- CoMa: Contextual Massing Generation with Vision-Language Models [7.943264761730892]
We propose an automated framework for generating building massing based on functional requirements and site context.<n>A primary obstacle to such data-driven methods has been the lack of suitable datasets.<n>We benchmark this dataset by formulating massing generation as a conditional task for Vision-Language Models.
arXiv Detail & Related papers (2026-01-13T11:44:00Z) - OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value [74.80873109856563]
OpenDataArena (ODA) is a holistic and open platform designed to benchmark the intrinsic value of post-training data.<n>ODA establishes a comprehensive ecosystem comprising four key pillars: (i) a unified training-evaluation pipeline that ensures fair, open comparisons across diverse models; (ii) a multi-dimensional scoring framework that profiles data quality along tens of distinct axes; and (iii) an interactive data lineage explorer to visualize dataset genealogy and dissect component sources.
arXiv Detail & Related papers (2025-12-16T03:33:24Z) - From Parameters to Performance: A Data-Driven Study on LLM Structure and Development [73.67759647072519]
Large language models (LLMs) have achieved remarkable success across various domains.<n>Despite the rapid growth in model scale and capability, systematic, data-driven research on how structural configurations affect performance remains scarce.<n>We present a large-scale dataset encompassing diverse open-source LLM structures and their performance across multiple benchmarks.
arXiv Detail & Related papers (2025-09-14T12:20:39Z) - Graph Foundation Models: A Comprehensive Survey [66.74249119139661]
Graph Foundation Models (GFMs) aim to bring scalable, general-purpose intelligence to structured data.<n>This survey provides a comprehensive overview of GFMs, unifying diverse efforts under a modular framework.<n>GFMs are poised to become foundational infrastructure for open-ended reasoning over structured data.
arXiv Detail & Related papers (2025-05-21T05:08:00Z) - OpenFACADES: An Open Framework for Architectural Caption and Attribute Data Enrichment via Street View Imagery [4.33299613844962]
Building properties play a crucial role in spatial data infrastructures, supporting applications such as energy simulation, risk assessment, and environmental modeling.<n>Recent advances have enabled the extraction and tagging of objective building attributes using remote sensing and street-level imagery.<n>This study bridges the gaps by introducing OpenFACADES, an open framework that leverages crowdsourced data to enrich building profiles.
arXiv Detail & Related papers (2025-04-01T08:20:13Z) - Investigating Public Fine-Tuning Datasets: A Complex Review of Current Practices from a Construction Perspective [2.12587313410587]
This paper reviews current public fine-tuning datasets from the perspective of data construction.
An overview of public fine-tuning datasets from two sides: evolution and taxonomy, is provided in this review.
arXiv Detail & Related papers (2024-07-11T13:11:16Z) - The GOOSE Dataset for Perception in Unstructured Environments [3.0408645115035036]
We present a comprehensive dataset specifically designed for unstructured outdoor environments.
The GOOSE dataset incorporates 10 000 labeled pairs of images and point clouds, which are utilized to train a range of state-of-the-art segmentation models.
This initiative aims to establish a common framework, enabling the seamless inclusion of existing datasets and a fast way to enhance the perception capabilities of various robots operating in unstructured environments.
arXiv Detail & Related papers (2023-10-25T17:20:38Z) - Data Mesh: a Systematic Gray Literature Review [3.038477115588261]
Data mesh is an emerging domain-driven decentralized data architecture that aims to minimize or avoid operational bottlenecks.
We systematically collected, analyzed, and synthesized 114 industrial gray literature articles.
The review provides insights into practitioners' perspectives on the four key principles of data mesh.
arXiv Detail & Related papers (2023-04-03T15:16:46Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - CateCom: a practical data-centric approach to categorization of
computational models [77.34726150561087]
We present an effort aimed at organizing the landscape of physics-based and data-driven computational models.
We apply object-oriented design concepts and outline the foundations of an open-source collaborative framework.
arXiv Detail & Related papers (2021-09-28T02:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.