Related papers: Drop the Golden Apples: Identifying Third-Party Reuse by DB-Less Software Composition Analysis

Drop the Golden Apples: Identifying Third-Party Reuse by DB-Less Software Composition Analysis

URL: http://arxiv.org/abs/2503.22576v1
Date: Fri, 28 Mar 2025 16:25:24 GMT
Title: Drop the Golden Apples: Identifying Third-Party Reuse by DB-Less Software Composition Analysis
Authors: Lyuye Zhang, Chengwei Liu, Jiahui Wu, Shiyang Zhang, Chengyue Liu, Zhengzi Xu, Sen Chen, Yang Liu,
Abstract summary: Third-party libraries (TPLs) in modern software development introduce significant security and compliance risks.<n>We propose the first framework of DB-Less Software Composition Analysis (SCA) to get rid of the traditional heavy database.<n>Our experiments on two typical scenarios, native library identification for Android and copy-based TPL reuse for C/C++, have demonstrated the favorable future for implementing database-less strategies in SCA.
Score: 11.193453132177222
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The prevalent use of third-party libraries (TPLs) in modern software development introduces significant security and compliance risks, necessitating the implementation of Software Composition Analysis (SCA) to manage these threats. However, the accuracy of SCA tools heavily relies on the quality of the integrated feature database to cross-reference with user projects. While under the circumstance of the exponentially growing of open-source ecosystems and the integration of large models into software development, it becomes even more challenging to maintain a comprehensive feature database for potential TPLs. To this end, after referring to the evolution of LLM applications in terms of external data interactions, we propose the first framework of DB-Less SCA, to get rid of the traditional heavy database and embrace the flexibility of LLMs to mimic the manual analysis of security analysts to retrieve identical evidence and confirm the identity of TPLs by supportive information from the open Internet. Our experiments on two typical scenarios, native library identification for Android and copy-based TPL reuse for C/C++, especially on artifacts that are not that underappreciated, have demonstrated the favorable future for implementing database-less strategies in SCA.

Related papers

Understanding the Supply Chain and Risks of Large Language Model Applications [25.571274158366563]
We introduce the first comprehensive dataset for analyzing and benchmarking Large Language Models (LLMs) supply chain security.<n>We collect 3,859 real-world LLM applications and perform interdependency analysis, identifying 109,211 models, 2,474 datasets, and 9,862 libraries.<n>Our findings reveal deeply nested dependencies in LLM applications and significant vulnerabilities across the supply chain, underscoring the need for comprehensive security analysis.
arXiv Detail & Related papers (2025-07-24T05:30:54Z)
Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study [55.09905978813599]
Large Language Models (LLMs) hold promise in automating data analysis tasks.<n>Yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios.<n>In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs.
arXiv Detail & Related papers (2025-06-24T17:04:23Z)
FamilyTool: A Multi-hop Personalized Tool Use Benchmark [94.1158032740113]
We introduce FamilyTool, a novel benchmark grounded in a family-based knowledge graph (KG) FamilyTool challenges Large Language Models with queries spanning 1 to 3 relational hops. Experiments reveal significant performance gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-04-09T10:42:36Z)
LLMs in Software Security: A Survey of Vulnerability Detection Techniques and Insights [12.424610893030353]
Large Language Models (LLMs) are emerging as transformative tools for software vulnerability detection.<n>This paper provides a detailed survey of LLMs in vulnerability detection.<n>We address challenges such as cross-language vulnerability detection, multimodal data integration, and repository-level analysis.
arXiv Detail & Related papers (2025-02-10T21:33:38Z)
Towards Human-Guided, Data-Centric LLM Co-Pilots [53.35493881390917]
CliMB-DC is a human-guided, data-centric framework for machine learning co-pilots.<n>It combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing.<n>We show how CliMB-DC can transform uncurated datasets into ML-ready formats.
arXiv Detail & Related papers (2025-01-17T17:51:22Z)
Enhancing Security in Third-Party Library Reuse -- Comprehensive Detection of 1-day Vulnerability through Code Patch Analysis [8.897599530972638]
Thirdparty libraries (TPLs) can introduce vulnerabilities (known as 1-day vulnerabilities) because of the low maintenance of TPLs.<n>VULTURE aims at identifying 1-day vulnerabilities that arise from the reuse of vulnerable TPLs.<n>VULTURE successfully identified 175 vulnerabilities from 178 reused TPLs.
arXiv Detail & Related papers (2024-11-29T12:02:28Z)
Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection [9.652886240532741]
This paper thoroughly analyses large language models' capabilities in detecting vulnerabilities within source code. We evaluate the performance of six open-source models that are specifically trained for vulnerability detection against six general-purpose LLMs.
arXiv Detail & Related papers (2024-08-29T10:00:57Z)
Large Language Model as a Catalyst: A Paradigm Shift in Base Station Siting Optimization [62.16747639440893]
Large language models (LLMs) and their associated technologies advance, particularly in the realms of prompt engineering and agent engineering.<n>Our proposed framework incorporates retrieval-augmented generation (RAG) to enhance the system's ability to acquire domain-specific knowledge and generate solutions.
arXiv Detail & Related papers (2024-08-07T08:43:32Z)
Exploring the extent of similarities in software failures across industries using LLMs [0.0]
This research utilizes the Failure Analysis Investigation with LLMs (FAIL) model to extract industry-specific information. In previous work news articles were collected from reputable sources and categorized by incidents inside a database. This research extends these methods by categorizing articles into specific domains and types of software failures.
arXiv Detail & Related papers (2024-08-07T03:48:07Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models. This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
ChatSOS: LLM-based knowledge Q&A system for safety engineering [0.0]
This study introduces an LLM-based Q&A system for safety engineering, enhancing the comprehension and response accuracy of the model. We employ prompt engineering to incorporate external knowledge databases, thus enriching the LLM with up-to-date and reliable information. Our findings indicate that the integration of external knowledge significantly augments the capabilities of LLM for in-depth problem analysis and autonomous task assignment.
arXiv Detail & Related papers (2023-12-14T03:25:23Z)
Serving Deep Learning Model in Relational Databases [70.53282490832189]
Serving deep learning (DL) models on relational data has become a critical requirement across diverse commercial and scientific domains. We highlight three pivotal paradigms: The state-of-the-art DL-centric architecture offloads DL computations to dedicated DL frameworks. The potential UDF-centric architecture encapsulates one or more tensor computations into User Defined Functions (UDFs) within the relational database management system (RDBMS)
arXiv Detail & Related papers (2023-10-07T06:01:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.