Related papers: Seamlessly Integrating Tree-Based Positional Embeddings into Transformer Models for Source Code Representation

Seamlessly Integrating Tree-Based Positional Embeddings into Transformer Models for Source Code Representation

URL: http://arxiv.org/abs/2507.04003v1
Date: Sat, 05 Jul 2025 11:07:47 GMT
Title: Seamlessly Integrating Tree-Based Positional Embeddings into Transformer Models for Source Code Representation
Authors: Patryk Bartkowiak, Filip Graliński,
Abstract summary: We propose a novel tree-based positional embedding approach that explicitly encodes hierarchical relationships derived from Abstract Syntax Trees (ASTs)<n>These hierarchical embeddings are integrated into the transformer architecture, specifically enhancing the CodeBERTa model.<n> Experimental results indicate that our Tree-Enhanced CodeBERTa consistently surpasses the baseline model in terms of loss, accuracy, F1 score, precision, and recall.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based models have demonstrated significant success in various source code representation tasks. Nonetheless, traditional positional embeddings employed by these models inadequately capture the hierarchical structure intrinsic to source code, typically represented as Abstract Syntax Trees (ASTs). To address this, we propose a novel tree-based positional embedding approach that explicitly encodes hierarchical relationships derived from ASTs, including node depth and sibling indices. These hierarchical embeddings are integrated into the transformer architecture, specifically enhancing the CodeBERTa model. We thoroughly evaluate our proposed model through masked language modeling (MLM) pretraining and clone detection fine-tuning tasks. Experimental results indicate that our Tree-Enhanced CodeBERTa consistently surpasses the baseline model in terms of loss, accuracy, F1 score, precision, and recall, emphasizing the importance of incorporating explicit structural information into transformer-based representations of source code.

Related papers

Hierarchical Quantized Diffusion Based Tree Generation Method for Hierarchical Representation and Lineage Analysis [49.00783841494125]
HDTree captures tree relationships within a hierarchical latent space using a unified hierarchical codebook and quantized diffusion processes.<n> HDTree's effectiveness is demonstrated through comparisons on both general-purpose and single-cell datasets.<n>These contributions provide a new tool for hierarchical lineage analysis, enabling more accurate and efficient modeling of cellular differentiation paths.
arXiv Detail & Related papers (2025-06-29T15:19:13Z)
Analysing the Behaviour of Tree-Based Neural Networks in Regression Tasks [3.912345988363511]
This paper endeavours to decode the behaviour of tree-based neural network models in the context of regression challenges. We extend the application of established models--tree-based CNNs, Code2Vec, and Transformer-based methods--to predict the execution time of source code by parsing it to an AST. Our proposed dual transformer demonstrates remarkable adaptability and robust performance across diverse datasets.
arXiv Detail & Related papers (2024-06-17T11:47:14Z)
Transformer Architecture for NetsDB [0.0]
We create an end-to-end implementation of a transformer for deep learning model serving in NetsDB. We load out weights from our model for distributed processing, deployment, and efficient inferencing.
arXiv Detail & Related papers (2024-05-08T04:38:36Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Implant Global and Local Hierarchy Information to Sequence based Code Representation Models [25.776540440893257]
We analyze how the complete hierarchical structure influences the tokens in code sequences and abstract this influence as a property of code tokens called hierarchical embedding. We propose the Hierarchy Transformer (HiT), a simple but effective sequence model to incorporate the complete hierarchical embeddings of source code into a Transformer model.
arXiv Detail & Related papers (2023-03-14T12:01:39Z)
SIT3: Code Summarization with Structure-Induced Transformer [48.000063280183376]
We propose a novel model based on structure-induced self-attention, which encodes sequential inputs with highly-effective structure modeling. Our newly-proposed model achieves new state-of-the-art results on popular benchmarks.
arXiv Detail & Related papers (2020-12-29T11:37:43Z)
GRIT: Generative Role-filler Transformers for Document-level Event Entity Extraction [134.5580003327839]
We introduce a generative transformer-based encoder-decoder framework (GRIT) to model context at the document level. We evaluate our approach on the MUC-4 dataset, and show that our model performs substantially better than prior work.
arXiv Detail & Related papers (2020-08-21T01:07:36Z)
Deep Autoencoding Topic Model with Scalable Hybrid Bayesian Inference [55.35176938713946]
We develop deep autoencoding topic model (DATM) that uses a hierarchy of gamma distributions to construct its multi-stochastic-layer generative network. We propose a Weibull upward-downward variational encoder that deterministically propagates information upward via a deep neural network, followed by a downward generative model. The efficacy and scalability of our models are demonstrated on both unsupervised and supervised learning tasks on big corpora.
arXiv Detail & Related papers (2020-06-15T22:22:56Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.