Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler
- URL: http://arxiv.org/abs/2004.01618v1
- Date: Fri, 3 Apr 2020 15:20:06 GMT
- Title: Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler
- Authors: Timofey Bryksin, Victor Petukhov, Ilya Alexin, Stanislav Prikhodko,
Alexey Shpilman, Vladimir Kovalenko, Nikita Povarov
- Abstract summary: We apply anomaly detection to source code and bytecode to facilitate the development of a programming language.
We define anomaly as a code fragment that is different from typical code written in a particular programming language.
- Score: 4.179323589439977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we apply anomaly detection to source code and bytecode to
facilitate the development of a programming language and its compiler. We
define anomaly as a code fragment that is different from typical code written
in a particular programming language. Identifying such code fragments is
beneficial to both language developers and end users, since anomalies may
indicate potential issues with the compiler or with runtime performance.
Moreover, anomalies could correspond to problems in language design. For this
study, we choose Kotlin as the target programming language. We outline and
discuss approaches to obtaining vector representations of source code and
bytecode and to the detection of anomalies across vectorized code snippets. The
paper presents a method that aims to detect two types of anomalies: syntax tree
anomalies and so-called compiler-induced anomalies that arise only in the
compiled bytecode. We describe several experiments that employ different
combinations of vectorization and anomaly detection techniques and discuss
types of detected anomalies and their usefulness for language developers. We
demonstrate that the extracted anomalies and the underlying extraction
technique provide additional value for language development.
Related papers
- Code Linting using Language Models [0.7519872646378836]
Code linters play a crucial role in developing high-quality software systems.
Despite their benefits, code linters are often language-specific, focused on certain types of issues, and prone to false positives.
This paper investigates whether large language models can be used to develop a more versatile code linter.
arXiv Detail & Related papers (2024-06-27T19:59:49Z) - CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation [60.799992690487336]
We propose Syntax Graph Retrieval Augmented Code Generation (CodeGRAG) to enhance the performance of LLMs in single-round code generation tasks.
CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Evolutionary Generative Fuzzing for Differential Testing of the Kotlin
Compiler [14.259471945857431]
We investigate the effectiveness of differential testing in finding bugs within the Kotlin compilers developed at JetBrains.
We propose a black-box generative approach that creates input programs for the K1 and K2 compilers.
Our case study shows that the proposed approach effectively detects bugs in K1 and K2; these bugs have been confirmed and (some) fixed by JetBrains developers.
arXiv Detail & Related papers (2024-01-12T16:01:12Z) - Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers [14.018844722021896]
We study the specific patterns that characterize machine- and human-authored code.
We propose DetectCodeGPT, a novel method for detecting machine-generated code.
arXiv Detail & Related papers (2024-01-12T09:15:20Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Language Agnostic Code Embeddings [61.84835551549612]
We focus on the cross-lingual capabilities of code embeddings across different programming languages.
Code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details.
We show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks.
arXiv Detail & Related papers (2023-10-25T17:34:52Z) - Reverse-Engineering Decoding Strategies Given Blackbox Access to a
Language Generation System [73.52878118434147]
We present methods to reverse-engineer the decoding method used to generate text.
Our ability to discover which decoding strategy was used has implications for detecting generated text.
arXiv Detail & Related papers (2023-09-09T18:19:47Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - Cross-Language Binary-Source Code Matching with Intermediate
Representations [27.843666274502198]
This paper formulates the problem of cross-language binary-source code matching, and develops a new dataset for this new problem.
We present a novel approach XLIR, which is a Transformer-based neural network by learning the intermediate representations for both binary and source code.
Our proposed XLIR with intermediate representations significantly outperforms other state-of-the-art models in both of the two tasks.
arXiv Detail & Related papers (2022-01-19T05:17:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.