Wild SBOMs: a Large-scale Dataset of Software Bills of Materials from Public Code
- URL: http://arxiv.org/abs/2503.15021v1
- Date: Wed, 19 Mar 2025 09:20:28 GMT
- Title: Wild SBOMs: a Large-scale Dataset of Software Bills of Materials from Public Code
- Authors: Luıs Soeiro, Thomas Robert, Stefano Zacchiroli,
- Abstract summary: Developers gain productivity by reusing readily available Free and Open Source Software (FOSS) components.<n>One approach to handle those difficulties is to use Software Bill of Materials (SBOMs)<n>A large scale study on SBOM practices based on SBOM files produced in the wild is still lacking.
- Score: 4.1920378271058425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developers gain productivity by reusing readily available Free and Open Source Software (FOSS) components. Such practices also bring some difficulties, such as managing licensing, components and related security. One approach to handle those difficulties is to use Software Bill of Materials (SBOMs). While there have been studies on the readiness of practitioners to embrace SBOMs and on the SBOM tools ecosystem, a large scale study on SBOM practices based on SBOM files produced in the wild is still lacking. A starting point for such a study is a large dataset of SBOM files found in the wild. We introduce such a dataset, consisting of over 78 thousand unique SBOM files, deduplicated from those found in over 94 million repositories. We include metadata that contains the standard and format used, quality score generated by the tool sbomqs, number of revisions, filenames and provenance information. Finally, we give suggestions and examples of research that could bring new insights on assessing and improving SBOM real practices.
Related papers
- A Dataset of Software Bill of Materials for Evaluating SBOM Consumption Tools [6.081142345739704]
A Software Bill of Materials (SBOM) is a list of components used in software.
Numerous tools support software dependency management through SBOMs.
There is no publicly available dataset specifically designed for this purpose.
We present a dataset of SBOMs generated from real-world Java projects.
arXiv Detail & Related papers (2025-04-09T13:35:02Z) - OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples.
Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments.
We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z) - Augmenting Software Bills of Materials with Software Vulnerability Description: A Preliminary Study on GitHub [8.727176816793179]
This paper reports the results of a preliminary study in which we augmented SBOMs of 40 open-source projects with information about Common Vulnerabilities and Exposures.<n>Our augmented SBOMs have been evaluated by submitting pull requests and by asking project owners to answer a survey.<n>Although, in most cases, augmented SBOMs were not directly accepted because owners required a continuous SBOM update, the received feedback shows the usefulness of the suggested SBOM augmentation.
arXiv Detail & Related papers (2025-03-18T08:04:22Z) - Leveraging Retrieval Augmented Generative LLMs For Automated Metadata Description Generation to Enhance Data Catalogs [1.1957520154275776]
Data catalogs serve as repositories for organizing and accessing diverse collection of data assets.<n>Many data catalogs within organizations suffer from limited searchability due to inadequate metadata like asset descriptions.<n>This paper explores the challenges associated with metadata creation and proposes a unique prompt enrichment idea of leveraging existing metadata content.
arXiv Detail & Related papers (2025-03-12T02:33:33Z) - SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories.<n> evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation.<n>We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z) - Software Bills of Materials in Maven Central [9.699225997570384]
There is little knowledge about how developers distribute Software Bills of Materials (SBOMs)<n>We mine SBOMs from Maven Central to assess the extent to which developers publish SBOMs along with the artifacts.<n>We present our methodology to mine SBOMs, as well as novel insights about SBOM publication.
arXiv Detail & Related papers (2025-01-23T16:56:40Z) - SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks.<n>SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues.<n>We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving state-of-the-art performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z) - MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models [66.64809260956312]
We propose a multi-granularity tool-use benchmark for large language models called MTU-Bench.
Our MTU-Bench is collected by transforming existing high-quality datasets to simulate real-world tool usage scenarios.
Comprehensive experimental results demonstrate the effectiveness of our MTU-Bench.
arXiv Detail & Related papers (2024-10-15T15:46:17Z) - InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning [58.7966588457529]
InfiMM-WebMath-40B is a high-quality dataset of interleaved image-text documents.
It comprises 24 million web pages, 85 million associated image URLs, and 40 billion text tokens, all meticulously extracted and filtered from CommonCrawl.
Our evaluations on text-only benchmarks show that, despite utilizing only 40 billion tokens, our dataset significantly enhances the performance of our 1.3B model.
Our models set a new state-of-the-art among open-source models on multi-modal math benchmarks such as MathVerse and We-Math.
arXiv Detail & Related papers (2024-09-19T08:41:21Z) - SBOM Generation Tools in the Python Ecosystem: an In-Detail Analysis [2.828503885204035]
We analyze four popular SBOM generation tools using the CycloneDX standard.
We highlight issues related to dependency versions, metadata files, remote dependencies, and optional dependencies.
We identify a systematic issue with the lack of standards for metadata in the PyPI ecosystem.
arXiv Detail & Related papers (2024-09-02T12:48:10Z) - Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models.
Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions.
We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z) - Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks [66.87070857705994]
In low-resource settings, the amount of seed data samples to use for data augmentation is very small.
We propose a novel method that augments training data by incorporating a wealth of examples from other datasets.
This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone.
arXiv Detail & Related papers (2024-02-21T02:45:46Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - JEMMA: An Extensible Java Dataset for ML4Code Applications [34.76698017961728]
We introduce JEMMA, a large-scale, diverse, and high-quality dataset targeted at Machine Learning for Source Code (ML4Code)
Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks.
JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties.
arXiv Detail & Related papers (2022-12-18T17:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.