Predicting Vulnerability In Large Codebases With Deep Code
Representation
- URL: http://arxiv.org/abs/2004.12783v1
- Date: Fri, 24 Apr 2020 13:18:35 GMT
- Title: Predicting Vulnerability In Large Codebases With Deep Code
Representation
- Authors: Anshul Tanwar, Krishna Sundaresan, Parmesh Ashwath, Prasanna Ganesan,
Sathish Kumar Chandrasekaran, Sriram Ravi
- Abstract summary: Software engineers write code for various modules, quite often, various types of errors get introduced.
Same or similar issues/bugs, which were fixed in the past (although in different modules), tend to get introduced in production code again.
We developed a novel AI-based system which uses the deep representation of Abstract Syntax Tree (AST) created from the source code and also the active feedback loop.
- Score: 6.357681017646283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Currently, while software engineers write code for various modules, quite
often, various types of errors - coding, logic, semantic, and others (most of
which are not caught by compilation and other tools) get introduced. Some of
these bugs might be found in the later stage of testing, and many times it is
reported by customers on production code. Companies have to spend many
resources, both money and time in finding and fixing the bugs which would have
been avoided if coding was done right. Also, concealed flaws in software can
lead to security vulnerabilities that potentially allow attackers to compromise
systems and applications. Interestingly, same or similar issues/bugs, which
were fixed in the past (although in different modules), tend to get introduced
in production code again.
We developed a novel AI-based system which uses the deep representation of
Abstract Syntax Tree (AST) created from the source code and also the active
feedback loop to identify and alert the potential bugs that could be caused at
the time of development itself i.e. as the developer is writing new code (logic
and/or function). This tool integrated with IDE as a plugin would work in the
background, point out existing similar functions/code-segments and any
associated bugs in those functions. The tool would enable the developer to
incorporate suggestions right at the time of development, rather than waiting
for UT/QA/customer to raise a defect.
We assessed our tool on both open-source code and also on Cisco codebase for
C and C++ programing language. Our results confirm that deep representation of
source code and the active feedback loop is an assuring approach for predicting
security and other vulnerabilities present in the code.
Related papers
- Understanding Code Understandability Improvements in Code Reviews [79.16476505761582]
We analyzed 2,401 code review comments from Java open-source projects on GitHub.
83.9% of suggestions for improvement were accepted and integrated, with fewer than 1% later reverted.
arXiv Detail & Related papers (2024-10-29T12:21:23Z) - Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - MoCo: Fuzzing Deep Learning Libraries via Assembling Code [13.937180393991616]
Deep learning techniques have been applied in software systems with various application scenarios.
DL libraries serve as the underlying foundation for DL systems, and bugs in them can have unpredictable impacts.
We propose MoCo, a novel fuzzing testing method for DL libraries via assembling code.
arXiv Detail & Related papers (2024-05-13T13:40:55Z) - LLM-Powered Code Vulnerability Repair with Reinforcement Learning and
Semantic Reward [3.729516018513228]
We introduce a multipurpose code vulnerability analysis system textttSecRepair, powered by a large language model, CodeGen2.
Inspired by how humans fix code issues, we propose an instruction-based dataset suitable for vulnerability analysis with LLMs.
We identify zero-day and N-day vulnerabilities in 6 Open Source IoT Operating Systems on GitHub.
arXiv Detail & Related papers (2024-01-07T02:46:39Z) - Assessing the Security of GitHub Copilot Generated Code -- A Targeted
Replication Study [11.644996472213611]
Recent studies have investigated security issues in AI-powered code generation tools such as GitHub Copilot and Amazon CodeWhisperer.
This paper replicates the study of Pearce et al., which investigated security weaknesses in Copilot and uncovered several weaknesses in the code suggested by Copilot.
Our results indicate that, even with the improvements in newer versions of Copilot, the percentage of vulnerable code suggestions has reduced from 36.54% to 27.25%.
arXiv Detail & Related papers (2023-11-18T22:12:59Z) - Large Language Models of Code Fail at Completing Code with Potential
Bugs [30.80172644795715]
We study the buggy-code completion problem inspired by real-time code suggestion.
We find that the presence of potential bugs significantly degrades the generation performance of the high-performing Code-LLMs.
arXiv Detail & Related papers (2023-06-06T06:35:27Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - Generation Probabilities Are Not Enough: Uncertainty Highlighting in AI Code Completions [54.55334589363247]
We study whether conveying information about uncertainty enables programmers to more quickly and accurately produce code.
We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits.
arXiv Detail & Related papers (2023-02-14T18:43:34Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets [0.0]
The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology.
The original contribution of this study was to examine half of the most significant code advances in the last 50 years.
arXiv Detail & Related papers (2023-01-05T23:17:17Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.