Better STEP, a format and dataset for boundary representation
- URL: http://arxiv.org/abs/2506.05417v1
- Date: Wed, 04 Jun 2025 22:52:07 GMT
- Title: Better STEP, a format and dataset for boundary representation
- Authors: Nafiseh Izadyar, Sai Chandra Madduri, Teseo Schneider,
- Abstract summary: Boundary representation (B-rep) generated from computer-aided design (CAD) is widely used in industry, with several large datasets available.<n>The data in these datasets is represented in STEP format, requiring a CAD kernel to read and process it.<n>This paper introduces an alternative format based on the open, cross-platform format HDF5 and a corresponding dataset for STEP files, paired with an open-source library to query and process them.
- Score: 6.013943959400016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Boundary representation (B-rep) generated from computer-aided design (CAD) is widely used in industry, with several large datasets available. However, the data in these datasets is represented in STEP format, requiring a CAD kernel to read and process it. This dramatically limits their scope and usage in large learning pipelines, as it constrains the possibility of deploying them on computing clusters due to the high cost of per-node licenses. This paper introduces an alternative format based on the open, cross-platform format HDF5 and a corresponding dataset for STEP files, paired with an open-source library to query and process them. Our Python package also provides standard functionalities such as sampling, normals, and curvature to ease integration in existing pipelines. To demonstrate the effectiveness of our format, we converted the Fusion 360 dataset and the ABC dataset. We developed four standard use cases (normal estimation, denoising, surface reconstruction, and segmentation) to assess the integrity of the data and its compliance with the original STEP files.
Related papers
- Text embedding models can be great data engineers [0.0]
We propose ADEPT, an automated data engineering pipeline via text embeddings.<n>We show that ADEPT outperforms the best existing benchmarks in a diverse set of datasets.
arXiv Detail & Related papers (2025-05-20T18:12:19Z) - Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z) - Enabling Advanced Land Cover Analytics: An Integrated Data Extraction Pipeline for Predictive Modeling with the Dynamic World Dataset [1.3757956340051605]
We present a flexible and efficient end to end pipeline for working with the Dynamic World dataset.
This includes a pre-processing and representation framework which tackles noise removal, efficient extraction of large amounts of data, and re-representation of LULC data.
To demonstrate the power of our pipeline, we use it to extract data for an urbanization prediction problem and build a suite of machine learning models with excellent performance.
arXiv Detail & Related papers (2024-10-11T16:13:01Z) - Out-of-Core Dimensionality Reduction for Large Data via Out-of-Sample Extensions [8.368145000145594]
Dimensionality reduction (DR) is a well-established approach for the visualization of high-dimensional data sets.
We propose the use of out-of-sample extensions to perform DR on large data sets.
We provide an evaluation of the projection quality of five common DR algorithms.
arXiv Detail & Related papers (2024-08-07T23:30:53Z) - CC-GPX: Extracting High-Quality Annotated Geospatial Data from Common Crawl [0.07499722271664144]
The Common Crawl (CC) corpus is the largest open web crawl dataset containing 9.5+ petabytes of data captured since 2008.
In this paper, we introduce an efficient pipeline to extract annotated user-generated tracks from GPX files found in CC.
The resulting multimodal dataset includes 1,416 pairings of human-written descriptions and MultiLineString vector data from the 6 most recent CC releases.
arXiv Detail & Related papers (2024-05-17T18:31:26Z) - Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - In-depth Analysis On Parallel Processing Patterns for High-Performance
Dataframes [0.0]
We present a set of parallel processing patterns for distributed dataframe operators and the reference runtime implementation, Cylon.
In this paper, we are expanding on the initial concept by introducing a cost model for evaluating the said patterns.
We evaluate the performance of Cylon on the ORNL Summit supercomputer.
arXiv Detail & Related papers (2023-07-03T23:11:03Z) - A Multi-Format Transfer Learning Model for Event Argument Extraction via
Variational Information Bottleneck [68.61583160269664]
Event argument extraction (EAE) aims to extract arguments with given roles from texts.
We propose a multi-format transfer learning model with variational information bottleneck.
We conduct extensive experiments on three benchmark datasets, and obtain new state-of-the-art performance on EAE.
arXiv Detail & Related papers (2022-08-27T13:52:01Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision
Datasets from 3D Scans [103.92680099373567]
This paper introduces a pipeline to parametrically sample and render multi-task vision datasets from comprehensive 3D scans from the real world.
Changing the sampling parameters allows one to "steer" the generated datasets to emphasize specific information.
Common architectures trained on a generated starter dataset reached state-of-the-art performance on multiple common vision tasks and benchmarks.
arXiv Detail & Related papers (2021-10-11T04:21:46Z) - Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research.
OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains.
For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.