MADE-WIC: Multiple Annotated Datasets for Exploring Weaknesses In Code
- URL: http://arxiv.org/abs/2408.05163v1
- Date: Fri, 9 Aug 2024 16:32:38 GMT
- Title: MADE-WIC: Multiple Annotated Datasets for Exploring Weaknesses In Code
- Authors: Moritz Mock, Jorge Melegati, Max Kretschmann, Nicolás E. Díaz Ferreyra, Barbara Russo,
- Abstract summary: MADE-WIC is a large dataset of functions and their comments with multiple annotations for technical debt and code weaknesses.
It contains about 860K code functions and more than 2.7M related comments from 12 open-source projects.
- Score: 2.399010142304227
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we present MADE-WIC, a large dataset of functions and their comments with multiple annotations for technical debt and code weaknesses leveraging different state-of-the-art approaches. It contains about 860K code functions and more than 2.7M related comments from 12 open-source projects. To the best of our knowledge, no such dataset is publicly available. MADE-WIC aims to provide researchers with a curated dataset on which to test and compare tools designed for the detection of code weaknesses and technical debt. As we have fused existing datasets, researchers have the possibility to evaluate the performance of their tools by also controlling the bias related to the annotation definition and dataset construction. The demonstration video can be retrieved at https://www.youtube.com/watch?v=GaQodPrcb6E.
Related papers
- Improving the detection of technical debt in Java source code with an enriched dataset [12.07607688189035]
Technical debt (TD) is the additional work and costs that emerge when developers opt for a quick and easy solution to a problem.
Recent research has focused on detecting Self-Admitted Technical Debts (SATDs) by analyzing comments embedded in source code.
We curated the first ever dataset of TD identified by code comments, coupled with its associated source code.
arXiv Detail & Related papers (2024-11-08T10:12:33Z) - SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - MatViX: Multimodal Information Extraction from Visually Rich Articles [6.349779979863784]
In materials science, extracting structured information from research articles can accelerate the discovery of new materials.
We introduce textscMatViX, a benchmark consisting of $324$ full-length research articles and $1,688$ complex structured files.
These files are extracted from text, tables, and figures in full-length documents, providing a comprehensive challenge for MIE.
arXiv Detail & Related papers (2024-10-27T16:13:58Z) - SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing [13.717170962455526]
We present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories.
Through a simple web interface, researchers can specify a set of mining criteria as well as specific pre-processing steps they want to perform.
After submitting the request, the user will receive an email with a download link for the required dataset within a few hours.
arXiv Detail & Related papers (2024-09-27T11:42:19Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - On Training a Neural Network to Explain Binaries [43.27448128029069]
In this work, we investigate the possibility of training a deep neural network on the task of binary code understanding.
We build our own dataset derived from a capture of Stack Overflow containing 1.1M entries.
arXiv Detail & Related papers (2024-04-30T15:34:51Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - JEMMA: An Extensible Java Dataset for ML4Code Applications [34.76698017961728]
We introduce JEMMA, a large-scale, diverse, and high-quality dataset targeted at Machine Learning for Source Code (ML4Code)
Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks.
JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties.
arXiv Detail & Related papers (2022-12-18T17:04:14Z) - KACC: A Multi-task Benchmark for Knowledge Abstraction, Concretization
and Completion [99.47414073164656]
A comprehensive knowledge graph (KG) contains an instance-level entity graph and an ontology-level concept graph.
The two-view KG provides a testbed for models to "simulate" human's abilities on knowledge abstraction, concretization, and completion.
We propose a unified KG benchmark by improving existing benchmarks in terms of dataset scale, task coverage, and difficulty.
arXiv Detail & Related papers (2020-04-28T16:21:57Z) - Rapidly Bootstrapping a Question Answering Dataset for COVID-19 [88.86456834766288]
We present CovidQA, the beginnings of a question answering dataset specifically designed for COVID-19.
This is the first publicly available resource of its type, and intended as a stopgap measure for guiding research until more substantial evaluation resources become available.
arXiv Detail & Related papers (2020-04-23T17:35:11Z) - Comprehensive Instructional Video Analysis: The COIN Dataset and
Performance Evaluation [100.68317848808327]
We present a large-scale dataset named as "COIN" for COmprehensive INstructional video analysis.
COIN dataset contains 11,827 videos of 180 tasks in 12 domains related to our daily life.
With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries.
arXiv Detail & Related papers (2020-03-20T16:59:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.