Scalability in Building Component Data Annotation: Enhancing Facade Material Classification with Synthetic Data
- URL: http://arxiv.org/abs/2404.08557v1
- Date: Fri, 12 Apr 2024 15:54:48 GMT
- Title: Scalability in Building Component Data Annotation: Enhancing Facade Material Classification with Synthetic Data
- Authors: Josie Harrison, Alexander Hollberg, Yinan Yu,
- Abstract summary: Computer vision models trained on Google Street View images can create material cadastres.
Current approaches need manually annotated datasets that are difficult to obtain and often have class imbalance.
This paper fine-tuned a Swin Transformer model on a synthetic dataset generated with DALL-E and compared the performance to a similar manually annotated dataset.
- Score: 45.981332942020856
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Computer vision models trained on Google Street View images can create material cadastres. However, current approaches need manually annotated datasets that are difficult to obtain and often have class imbalance. To address these challenges, this paper fine-tuned a Swin Transformer model on a synthetic dataset generated with DALL-E and compared the performance to a similar manually annotated dataset. Although manual annotation remains the gold standard, the synthetic dataset performance demonstrates a reasonable alternative. The findings will ease annotation needed to develop material cadastres, offering architects insights into opportunities for material reuse, thus contributing to the reduction of demolition waste.
Related papers
- Understanding Synthetic Context Extension via Retrieval Heads [51.8869530817334]
We investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning.
We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted.
Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world capabilities over long contexts.
arXiv Detail & Related papers (2024-10-29T17:55:00Z) - Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - JOBSKAPE: A Framework for Generating Synthetic Job Postings to Enhance
Skill Matching [18.94748873243611]
JobSkape is a framework to generate synthetic data for skill-to-taxonomy matching.
Within this framework, we create SkillSkape, a comprehensive open-source synthetic dataset of job postings.
We present a multi-step pipeline for skill extraction and matching tasks using large language models.
arXiv Detail & Related papers (2024-02-05T17:57:26Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Effective Few-Shot Named Entity Linking by Meta-Learning [34.70028855572534]
We propose a novel weak supervision strategy to generate non-trivial synthetic entity-mention pairs.
We also design a meta-learning mechanism to assign different weights to each synthetic entity-mention pair automatically.
Experiments on real-world datasets show that the proposed method can extensively improve the state-of-the-art few-shot entity linking model.
arXiv Detail & Related papers (2022-07-12T03:23:02Z) - Unsupervised Opinion Summarization with Content Planning [58.5308638148329]
We show that explicitly incorporating content planning in a summarization model yields output of higher quality.
We also create synthetic datasets which are more natural, resembling real world document-summary pairs.
Our approach outperforms competitive models in generating informative, coherent, and fluent summaries.
arXiv Detail & Related papers (2020-12-14T18:41:58Z) - Assembling Semantically-Disentangled Representations for
Predictive-Generative Models via Adaptation from Synthetic Domain [32.42156485883356]
We show that semantically-aligned representations can be generated with the help of a physics based engine.
It is shown that the proposed (SYNTH-VAE-GAN) method can construct a conditional-generative model of human face attributes without relying on real data labels.
arXiv Detail & Related papers (2020-02-23T03:35:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.