Automatic Analysis of Available Source Code of Top Artificial
Intelligence Conference Papers
- URL: http://arxiv.org/abs/2209.14155v1
- Date: Wed, 28 Sep 2022 15:05:58 GMT
- Title: Automatic Analysis of Available Source Code of Top Artificial
Intelligence Conference Papers
- Authors: Jialiang Lin, Yingmin Wang, Yao Yu, Yu Zhou, Yidong Chen, Xiaodong Shi
- Abstract summary: We propose a method to automatically identify papers with available source code and extract their source code repository URLs.
We find that 20.5% of regular papers of 10 top AI conferences published from 2010 to 2019 are identified as papers with available source code.
A large-scale comprehensive statistical analysis is made for a general picture of the source code of AI conference papers.
- Score: 9.498078340492087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Source code is essential for researchers to reproduce the methods and
replicate the results of artificial intelligence (AI) papers. Some
organizations and researchers manually collect AI papers with available source
code to contribute to the AI community. However, manual collection is a
labor-intensive and time-consuming task. To address this issue, we propose a
method to automatically identify papers with available source code and extract
their source code repository URLs. With this method, we find that 20.5% of
regular papers of 10 top AI conferences published from 2010 to 2019 are
identified as papers with available source code and that 8.1% of these source
code repositories are no longer accessible. We also create the XMU NLP Lab
README Dataset, the largest dataset of labeled README files for source code
document research. Through this dataset, we have discovered that quite a few
README files have no installation instructions or usage tutorials provided.
Further, a large-scale comprehensive statistical analysis is made for a general
picture of the source code of AI conference papers. The proposed solution can
also go beyond AI conference papers to analyze other scientific papers from
both journals and conferences to shed light on more domains.
Related papers
- AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents [49.67355440164857]
We introduce AIRS-Bench, a suite of 20 tasks sourced from state-of-the-art machine learning papers.<n>Airs-Bench tasks assess agentic capabilities over the full research lifecycle.<n>We open-source the AIRS-Bench task definitions and evaluation code to catalyze further development in autonomous scientific research.
arXiv Detail & Related papers (2026-02-06T16:45:02Z) - PaperSearchQA: Learning to Search and Reason over Scientific Papers with RLVR [64.22412492998754]
We release a search corpus of 16 million biomedical paper abstracts and construct a challenging factoid QA dataset called PaperSearchQA.<n>We train search agents in this environment to outperform non-RL retrieval baselines.<n>Our data creation methods are scalable and easily extendable to other scientific domains.
arXiv Detail & Related papers (2026-01-26T06:46:16Z) - Executable Knowledge Graphs for Replicating AI Research [65.41207324831583]
Executable Knowledge Graphs (xKG) is a modular and pluggable knowledge base that automatically integrates technical insights, code snippets, and domain-specific knowledge extracted from scientific literature.<n>Code will released at https://github.com/zjunlp/xKG.
arXiv Detail & Related papers (2025-10-20T17:53:23Z) - You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models [1.0268444449457959]
In the absence of sanitization, submissions may disclose sensitive information that adversaries can harvest using open-source intelligence.<n>We present the first large-scale security audit of preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv submissions.<n>We urge the research community and repository operators to take immediate action to close these hidden security gaps.
arXiv Detail & Related papers (2025-10-04T10:03:17Z) - Scaling Generalist Data-Analytic Agents [95.05161133349242]
DataMind is a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents.<n>DataMind tackles three key challenges in building open-source data-analytic agents.
arXiv Detail & Related papers (2025-09-29T17:23:08Z) - Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning [57.09163579304332]
We introduce PaperCoder, a framework that transforms machine learning papers into functional code repositories.
PaperCoder operates in three stages: planning, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files.
We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations.
arXiv Detail & Related papers (2025-04-24T01:57:01Z) - PaperBench: Evaluating AI's Ability to Replicate AI Research [3.4567792239799133]
PaperBench is a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.
Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch.
PaperBench contains 8,316 individually gradable tasks.
arXiv Detail & Related papers (2025-04-02T15:55:24Z) - BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks [57.589795399265945]
We introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks.
We also introduce BigDocs-Bench, a benchmark suite with 10 novel tasks.
Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o.
arXiv Detail & Related papers (2024-12-05T21:41:20Z) - Automatic answering of scientific questions using the FACTS-V1 framework: New methods in research to increase efficiency through the use of AI [0.0]
This article presents the prototype of the FACTS-V1 (Filtering and Analysis of Content in Textual Sources) framework.
With the help of the application, numerous scientific papers can be automatically extracted, analyzed and interpreted from open access document servers.
The aim of the framework is to provide recommendations for future scientific questions based on existing data.
arXiv Detail & Related papers (2024-12-01T18:55:39Z) - OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.
While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing [13.717170962455526]
We present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories.
Through a simple web interface, researchers can specify a set of mining criteria as well as specific pre-processing steps they want to perform.
After submitting the request, the user will receive an email with a download link for the required dataset within a few hours.
arXiv Detail & Related papers (2024-09-27T11:42:19Z) - DeepDiveAI: Identifying AI Related Documents in Large Scale Literature Data [4.870043547158868]
The dataset was created using an advanced Long Short-Term Memory (LSTM) model trained on a binary classification task.
The model was trained and validated on a vast dataset, achieving high accuracy, precision, recall, and F1-score.
The resulting DeepDelveAI dataset comprises over 9.4 million AI-related papers published since Dartmouth Conference, from 1956 to 2024.
arXiv Detail & Related papers (2024-08-23T07:05:12Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data.
We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z) - Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning [1.8270184406083445]
We explore using large language models (LLM) and prompting strategies to automatically extract dimensions from documents.
Our approach could aid data publishers and practitioners in creating machine-readable documentation.
We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results.
arXiv Detail & Related papers (2024-04-04T10:09:28Z) - GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training
Data Exploration [97.68234051078997]
We discuss how Pyserini can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts.
We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub.
We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections.
arXiv Detail & Related papers (2023-06-02T12:09:59Z) - Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes [54.13559879916708]
EVAPORATE is a prototype system powered by large language models (LLMs)
Code synthesis is cheap, but far less accurate than directly processing each document with the LLM.
We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
arXiv Detail & Related papers (2023-04-19T06:00:26Z) - Paperswithtopic: Topic Identification from Paper Title Only [5.025654873456756]
We present a dataset of papers paired by title and sub-field from the field of artificial intelligence (AI)
We also present results on how to predict a paper's AI sub-field from a given paper title only.
For the transformer models, we also present gradient-based, attention visualizations to further explain the model's classification process.
arXiv Detail & Related papers (2021-10-09T06:32:09Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Artificial Intelligence in Drug Discovery: Applications and Techniques [33.59138543942538]
Various AI techniques have been used in a wide range of applications, such as virtual screening and drug design.
We first give an overview on drug discovery and discuss related applications, which can be reduced to two major tasks.
We then discuss common data resources, molecule representations and benchmark platforms.
arXiv Detail & Related papers (2021-06-09T20:46:44Z) - A Methodology for Creating AI FactSheets [67.65802440158753]
This paper describes a methodology for creating the form of AI documentation we call FactSheets.
Within each step of the methodology, we describe the issues to consider and the questions to explore.
This methodology will accelerate the broader adoption of transparent AI documentation.
arXiv Detail & Related papers (2020-06-24T15:08:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.