Related papers: SBOM Challenges for Developers: From Analysis of Stack Overflow Questions

Related papers

LEXam: Benchmarking Legal Reasoning on 340 Law Exams [61.344330783528015]
LEXam is a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.<n>The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions.
arXiv Detail & Related papers (2025-05-19T08:48:12Z)
Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources. We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z)
Augmenting Software Bills of Materials with Software Vulnerability Description: A Preliminary Study on GitHub [8.727176816793179]
This paper reports the results of a preliminary study in which we augmented SBOMs of 40 open-source projects with information about Common Vulnerabilities and Exposures. Our augmented SBOMs have been evaluated by submitting pull requests and by asking project owners to answer a survey. Although, in most cases, augmented SBOMs were not directly accepted because owners required a continuous SBOM update, the received feedback shows the usefulness of the suggested SBOM augmentation.
arXiv Detail & Related papers (2025-03-18T08:04:22Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs. LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
Developer Challenges on Large Language Models: A Study of Stack Overflow and OpenAI Developer Forum Posts [2.704899832646869]
Large Language Models (LLMs) have gained widespread popularity due to their exceptional capabilities across various domains. This study investigates developers' challenges by analyzing community interactions on Stack Overflow and OpenAI Developer Forum.
arXiv Detail & Related papers (2024-11-16T19:38:27Z)
Benchmarking Multimodal Retrieval Augmented Generation with Dynamic VQA Dataset and Self-adaptive Planning Agent [102.31558123570437]
Multimodal Retrieval Augmented Generation (mRAG) plays an important role in mitigating the "hallucination" issue inherent in multimodal large language models (MLLMs)<n>We propose the first self-adaptive planning agent for multimodal retrieval, OmniSearch.
arXiv Detail & Related papers (2024-11-05T09:27:21Z)
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories [55.161075901665946]
Super aims to capture the realistic challenges faced by researchers working with Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets: 45 end-to-end problems with annotated expert solutions, 152 sub problems derived from the expert set that focus on specific challenges, and 602 automatically generated problems for larger-scale development. We show that state-of-the-art approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the scenarios.
arXiv Detail & Related papers (2024-09-11T17:37:48Z)
LiveBench: A Challenging, Contamination-Free LLM Benchmark [101.21578097087699]
We release LiveBench, the first benchmark that contains frequently-updated questions from recent information sources. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time.
arXiv Detail & Related papers (2024-06-27T16:47:42Z)
ChatGPT vs LLaMA: Impact, Reliability, and Challenges in Stack Overflow Discussions [13.7001994656622]
ChatGPT has shaken up Stack Overflow, the premier platform for developers' queries on programming and software development. Two months after ChatGPT's release, Meta released its answer with its own Large Language Model (LLM) called LLaMA: the race was on.
arXiv Detail & Related papers (2024-02-13T21:15:33Z)
BOMs Away! Inside the Minds of Stakeholders: A Comprehensive Study of Bills of Materials for Software Systems [11.719062411327952]
Software Bills of Materials (SBOMs) have emerged as tools to facilitate the management of software dependencies, vulnerabilities, licenses, and the supply chain. Recent studies have shown that SBOMs are still an early technology not yet adequately adopted in practice. We identify 12 major challenges facing the creation and use of SBOMs, including those related to the SBOM content, deficiencies in SBOM tools, SBOM maintenance and verification, and domain-specific challenges.
arXiv Detail & Related papers (2023-09-21T16:11:00Z)
On the Way to SBOMs: Investigating Design Issues and Solutions in Practice [25.12690604349815]
The Software Bill of Materials (SBOM) has emerged as a promising solution, providing a machine-readable inventory of software components used. This paper presents an analysis of 4,786 GitHub discussions from 510 SBOM-related projects.
arXiv Detail & Related papers (2023-04-26T03:30:31Z)
Understanding the Usability Challenges of Machine Learning In High-Stakes Decision Making [67.72855777115772]
Machine learning (ML) is being applied to a diverse and ever-growing set of domains. In many cases, domain experts -- who often have no expertise in ML or data science -- are asked to use ML predictions to make high-stakes decisions. We investigate the ML usability challenges present in the domain of child welfare screening through a series of collaborations with child welfare screeners.
arXiv Detail & Related papers (2021-03-02T22:50:45Z)
OpenHoldem: An Open Toolkit for Large-Scale Imperfect-Information Game Research [82.09426894653237]
OpenHoldem is an integrated toolkit for large-scale imperfect-information game research using NLTH. OpenHoldem makes three main contributions to this research direction: 1) a standardized evaluation protocol for thoroughly evaluating different NLTH AIs, 2) three publicly available strong baselines for NLTH AI, and 3) an online testing platform with easy-to-use APIs for public NLTH AI evaluation.
arXiv Detail & Related papers (2020-12-11T07:24:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.