Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler
- URL: http://arxiv.org/abs/2004.01618v1
- Date: Fri, 3 Apr 2020 15:20:06 GMT
- Title: Using Large-Scale Anomaly Detection on Code to Improve Kotlin Compiler
- Authors: Timofey Bryksin, Victor Petukhov, Ilya Alexin, Stanislav Prikhodko,
Alexey Shpilman, Vladimir Kovalenko, Nikita Povarov
- Abstract summary: We apply anomaly detection to source code and bytecode to facilitate the development of a programming language.
We define anomaly as a code fragment that is different from typical code written in a particular programming language.
- Score: 4.179323589439977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we apply anomaly detection to source code and bytecode to
facilitate the development of a programming language and its compiler. We
define anomaly as a code fragment that is different from typical code written
in a particular programming language. Identifying such code fragments is
beneficial to both language developers and end users, since anomalies may
indicate potential issues with the compiler or with runtime performance.
Moreover, anomalies could correspond to problems in language design. For this
study, we choose Kotlin as the target programming language. We outline and
discuss approaches to obtaining vector representations of source code and
bytecode and to the detection of anomalies across vectorized code snippets. The
paper presents a method that aims to detect two types of anomalies: syntax tree
anomalies and so-called compiler-induced anomalies that arise only in the
compiled bytecode. We describe several experiments that employ different
combinations of vectorization and anomaly detection techniques and discuss
types of detected anomalies and their usefulness for language developers. We
demonstrate that the extracted anomalies and the underlying extraction
technique provide additional value for language development.
Related papers
- Development and Benchmarking of Multilingual Code Clone Detector [2.253851493296371]
multilingual code clone detectors make it easier to add new language support by providing syntax information of the target language only.
We propose a multilingual code block extraction method based on ANTLR generation and implement a multilingual code clone detector (MSCCD)
Compared to ten state-of-the-art detectors, MSCCD performs at an average level while it also supports a significantly larger number of languages.
arXiv Detail & Related papers (2024-09-10T03:08:33Z) - Code Linting using Language Models [0.7519872646378836]
Code linters play a crucial role in developing high-quality software systems.
Despite their benefits, code linters are often language-specific, focused on certain types of issues, and prone to false positives.
This paper investigates whether large language models can be used to develop a more versatile code linter.
arXiv Detail & Related papers (2024-06-27T19:59:49Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Evolutionary Generative Fuzzing for Differential Testing of the Kotlin
Compiler [14.259471945857431]
We investigate the effectiveness of differential testing in finding bugs within the Kotlin compilers developed at JetBrains.
We propose a black-box generative approach that creates input programs for the K1 and K2 compilers.
Our case study shows that the proposed approach effectively detects bugs in K1 and K2; these bugs have been confirmed and (some) fixed by JetBrains developers.
arXiv Detail & Related papers (2024-01-12T16:01:12Z) - Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers [14.018844722021896]
We study the specific patterns that characterize machine- and human-authored code.
We propose DetectCodeGPT, a novel method for detecting machine-generated code.
arXiv Detail & Related papers (2024-01-12T09:15:20Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Language Agnostic Code Embeddings [61.84835551549612]
We focus on the cross-lingual capabilities of code embeddings across different programming languages.
Code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details.
We show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks.
arXiv Detail & Related papers (2023-10-25T17:34:52Z) - Reverse-Engineering Decoding Strategies Given Blackbox Access to a
Language Generation System [73.52878118434147]
We present methods to reverse-engineer the decoding method used to generate text.
Our ability to discover which decoding strategy was used has implications for detecting generated text.
arXiv Detail & Related papers (2023-09-09T18:19:47Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z) - Inducing Language-Agnostic Multilingual Representations [61.97381112847459]
Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world.
We examine three approaches for this: (i) re-aligning the vector spaces of target languages to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering.
arXiv Detail & Related papers (2020-08-20T17:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.