Evaluating the Impact of Source Code Parsers on ML4SE Models
- URL: http://arxiv.org/abs/2206.08713v1
- Date: Fri, 17 Jun 2022 12:10:04 GMT
- Title: Evaluating the Impact of Source Code Parsers on ML4SE Models
- Authors: Ilya Utkin, Egor Spirin, Egor Bogomolov, Timofey Bryksin
- Abstract summary: We evaluate two models, namely, Supernorm2Seq and TreeLSTM, in the name prediction language.
We show that trees built by differents vary in their structure and content.
We then analyze how this diversity affects the models' quality.
- Score: 3.699097874146491
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As researchers and practitioners apply Machine Learning to increasingly more
software engineering problems, the approaches they use become more
sophisticated. A lot of modern approaches utilize internal code structure in
the form of an abstract syntax tree (AST) or its extensions: path-based
representation, complex graph combining AST with additional edges. Even though
the process of extracting ASTs from code can be done with different parsers,
the impact of choosing a parser on the final model quality remains unstudied.
Moreover, researchers often omit the exact details of extracting particular
code representations.
In this work, we evaluate two models, namely Code2Seq and TreeLSTM, in the
method name prediction task backed by eight different parsers for the Java
language. To unify the process of data preparation with different parsers, we
develop SuperParser, a multi-language parser-agnostic library based on
PathMiner. SuperParser facilitates the end-to-end creation of datasets suitable
for training and evaluation of ML models that work with structural information
from source code. Our results demonstrate that trees built by different parsers
vary in their structure and content. We then analyze how this diversity affects
the models' quality and show that the quality gap between the most and least
suitable parsers for both models turns out to be significant. Finally, we
discuss other features of the parsers that researchers and practitioners should
take into account when selecting a parser along with the impact on the models'
quality.
The code of SuperParser is publicly available at
https://doi.org/10.5281/zenodo.6366591. We also publish Java-norm, the dataset
we use to evaluate the models: https://doi.org/10.5281/zenodo.6366599.
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - MRL Parsing Without Tears: The Case of Hebrew [14.104766026682384]
In morphologically rich languages (MRLs), wheres need to identify multiple lexical units in each token, existing systems suffer in latency and setup complexity.
We present a new "flipped pipeline": decisions are made directly on the whole-token units by expert classifiers, each one dedicated to one specific task.
This blazingly fast approach sets a new SOTA in Hebrew POS tagging and dependency parsing, while also reaching near-SOTA performance on other Hebrew tasks.
arXiv Detail & Related papers (2024-03-11T17:54:33Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - Unsupervised and Few-shot Parsing from Pretrained Language Models [56.33247845224995]
We propose an Unsupervised constituent Parsing model that calculates an Out Association score solely based on the self-attention weight matrix learned in a pretrained language model.
We extend the unsupervised models to few-shot parsing models that use a few annotated trees to learn better linear projection matrices for parsing.
Our few-shot parsing model FPIO trained with only 20 annotated trees outperforms a previous few-shot parsing method trained with 50 annotated trees.
arXiv Detail & Related papers (2022-06-10T10:29:15Z) - Low-Resource Task-Oriented Semantic Parsing via Intrinsic Modeling [65.51280121472146]
We exploit what we intrinsically know about ontology labels to build efficient semantic parsing models.
Our model is highly efficient using a low-resource benchmark derived from TOPv2.
arXiv Detail & Related papers (2021-04-15T04:01:02Z) - Learning to Synthesize Data for Semantic Parsing [57.190817162674875]
We propose a generative model which models the composition of programs and maps a program to an utterance.
Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand.
We evaluate our method in both in-domain and out-of-domain settings of text-to-Query parsing on the standard benchmarks of GeoQuery and Spider.
arXiv Detail & Related papers (2021-04-12T21:24:02Z) - Applying Occam's Razor to Transformer-Based Dependency Parsing: What
Works, What Doesn't, and What is Really Necessary [9.347252855045125]
We study the choice of pre-trained embeddings and whether they use LSTM layers in graph-based dependency schemes.
We propose a simple but widely applicable architecture and configuration, achieving new state-of-the-art results (in terms of LAS) for 10 out of 12 diverse languages.
arXiv Detail & Related papers (2020-10-23T22:58:26Z) - Towards Instance-Level Parser Selection for Cross-Lingual Transfer of
Dependency Parsers [59.345145623931636]
We argue for a novel cross-lingual transfer paradigm: instance-level selection (ILPS)
We present a proof-of-concept study focused on instance-level selection in the framework of delexicalized transfer.
arXiv Detail & Related papers (2020-04-16T13:18:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.