Related papers: Empirical Evaluation of AI-Assisted Software Package Selection: A Knowledge Graph Approach

Empirical Evaluation of AI-Assisted Software Package Selection: A Knowledge Graph Approach

URL: http://arxiv.org/abs/2508.05693v1
Date: Wed, 06 Aug 2025 13:55:43 GMT
Title: Empirical Evaluation of AI-Assisted Software Package Selection: A Knowledge Graph Approach
Authors: Siamak Farshidi, Amir Saberhabibi, Behbod Eskafi, Niloofar Nikfarjam, Sadegh Eskandari, Slinger Jansen, Michel Chaudron, Bedir Tekinerdogan,
Abstract summary: This study formulates software package selection as a Multi-Criteria Decision-Making (MCDM) problem.<n>Data pipelines continuously collect and integrate software metadata, usage trends, vulnerability information, and developer sentiment.<n>System uses large language models to interpret user intent and query the model to identify contextually appropriate packages.
Score: 4.100870096741918
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Selecting third-party software packages in open-source ecosystems like Python is challenging due to the large number of alternatives and limited transparent evidence for comparison. Generative AI tools are increasingly used in development workflows, but their suggestions often overlook dependency evaluation, emphasize popularity over suitability, and lack reproducibility. This creates risks for projects that require transparency, long-term reliability, maintainability, and informed architectural decisions. This study formulates software package selection as a Multi-Criteria Decision-Making (MCDM) problem and proposes a data-driven framework for technology evaluation. Automated data pipelines continuously collect and integrate software metadata, usage trends, vulnerability information, and developer sentiment from GitHub, PyPI, and Stack Overflow. These data are structured into a decision model representing relationships among packages, domain features, and quality attributes. The framework is implemented in PySelect, a decision support system that uses large language models to interpret user intent and query the model to identify contextually appropriate packages. The approach is evaluated using 798,669 Python scripts from 16,887 GitHub repositories and a user study based on the Technology Acceptance Model. Results show high data extraction precision, improved recommendation quality over generative AI baselines, and positive user evaluations of usefulness and ease of use. This work introduces a scalable, interpretable, and reproducible framework that supports evidence-based software selection using MCDM principles, empirical data, and AI-assisted intent modeling.

Related papers

Why Authors and Maintainers Link (or Don't Link) Their PyPI Libraries to Code Repositories and Donation Platforms [83.16077040470975]
Metadata of libraries on the Python Package Index (PyPI) plays a critical role in supporting the transparency, trust, and sustainability of open-source libraries.<n>This paper presents a large-scale empirical study combining two targeted surveys sent to 50,000 PyPI authors and maintainers.<n>We analyze more than 1,400 responses using large language model (LLM)-based topic modeling to uncover key motivations and barriers related to linking repositories and donation platforms.
arXiv Detail & Related papers (2026-01-21T16:13:57Z)
GAICo: A Deployed and Extensible Framework for Evaluating Diverse and Multimodal Generative AI Outputs [8.34331981959369]
We present GAICo (Generative AI Comparator): a deployed, open-source Python library that standardizes GenAI output comparison.<n> GAICo provides a unified, framework supporting a comprehensive suite of reference-based metrics for unstructured text, structured data formats, and multimedia.<n>Since its release on PyPI in Jun 2025, the tool has been downloaded over 13K times, across versions, by Aug 2025, demonstrating growing community interest.
arXiv Detail & Related papers (2025-08-22T19:13:21Z)
Identity resolution of software metadata using Large Language Models [0.0]
This article presents an evaluation of instruction-tuned large language models for the task of software metadata identity resolution.<n>We benchmarked multiple models against a human-annotated gold standard, examined their behavior on ambiguous cases, and introduced an agreement-based proxy for high-confidence automated decisions.
arXiv Detail & Related papers (2025-05-29T14:47:31Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
VirtualXAI: A User-Centric Framework for Explainability Assessment Leveraging GPT-Generated Personas [0.07499722271664146]
The demand for eXplainable AI (XAI) has increased to enhance the interpretability, transparency, and trustworthiness of AI models.<n>We propose a framework that integrates quantitative benchmarking with qualitative user assessments through virtual personas.<n>This yields an estimated XAI score and provides tailored recommendations for both the optimal AI model and the XAI method for a given scenario.
arXiv Detail & Related papers (2025-03-06T09:44:18Z)
xai_evals : A Framework for Evaluating Post-Hoc Local Explanation Methods [1.747623282473278]
xai_evals provides a framework for generating, benchmarking, and evaluating explanation methods.<n>It integrates popular techniques like SHAP, LIME, Grad-CAM, Integrated Gradients (IG), and Backtrace.<n>xai_evals enhances the interpretability of machine learning models, fostering transparency and trust in AI systems.
arXiv Detail & Related papers (2025-02-05T09:17:48Z)
LatteReview: A Multi-Agent Framework for Systematic Review Automation Using Large Language Models [0.0]
LatteReview is a Python-based framework that leverages large language models (LLMs) and multi-agent systems to automate key elements of the systematic review process.<n>The framework supports features such as Retrieval-Augmented Generation (RAG) for incorporating external context, multimodal reviews, Pydantic-based validation for structured inputs and outputs, and asynchronous programming for handling large-scale datasets.
arXiv Detail & Related papers (2025-01-05T17:53:00Z)
A Statistical Framework for Ranking LLM-Based Chatbots [57.59268154690763]
We propose a statistical framework that incorporates key advancements to address specific challenges in pairwise comparison analysis.<n>First, we introduce a factored tie model that enhances the ability to handle groupings of human-judged comparisons.<n>Second, we extend the framework to model covariance tiers between competitors, enabling deeper insights into performance relationships.<n>Third, we resolve optimization challenges arising from parameter non-uniqueness by introducing novel constraints.
arXiv Detail & Related papers (2024-12-24T12:54:19Z)
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models [64.28420991770382]
Data-Juicer 2.0 is a data processing system backed by data processing operators spanning text, image, video, and audio modalities.<n>It supports more critical tasks including data analysis, annotation, and foundation model post-training.<n>It has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI.
arXiv Detail & Related papers (2024-12-23T08:29:57Z)
Generative AI for Software Metadata: Overview of the Information Retrieval in Software Engineering Track at FIRE 2023 [18.616716369775883]
The Information Retrieval in Software Engineering (IRSE) track aims to develop solutions for automated evaluation of code comments. The dataset consists of 9048 code comments and surrounding code snippet pairs extracted from open source C based projects. The labels generated from large language models increase the bias in the prediction model but lead to less over-fitted results.
arXiv Detail & Related papers (2023-10-27T14:13:23Z)
Latte: Cross-framework Python Package for Evaluation of Latent-Based Generative Models [65.51757376525798]
Latte is a Python library for evaluation of latent-based generative models. Latte is compatible with both PyTorch and/Keras, and provides both functional and modular APIs.
arXiv Detail & Related papers (2021-12-20T16:00:28Z)
DIETERpy: a Python framework for The Dispatch and Investment Evaluation Tool with Endogenous Renewables [62.997667081978825]
DIETER is an open-source power sector model designed to analyze future settings with very high shares of variable renewable energy sources. It minimizes overall system costs, including fixed and variable costs of various generation, flexibility and sector coupling options. We introduce DIETERpy that builds on the existing model version, written in the General Algebraic Modeling System (GAMS) and enhances it with a Python framework.
arXiv Detail & Related papers (2020-10-02T09:27:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.