Identification of the Relevance of Comments in Codes Using Bag of Words
and Transformer Based Models
- URL: http://arxiv.org/abs/2308.06144v1
- Date: Fri, 11 Aug 2023 14:06:41 GMT
- Title: Identification of the Relevance of Comments in Codes Using Bag of Words
and Transformer Based Models
- Authors: Sruthi S, Tanmay Basu
- Abstract summary: The paper presents the overview of the models and other significant findings on the training corpus.
The performance of different such models over the training corpus were reported and the best five models were implemented on the given test corpus.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Forum for Information Retrieval (FIRE) started a shared task this year
for classification of comments of different code segments. This is binary text
classification task where the objective is to identify whether comments given
for certain code segments are relevant or not. The BioNLP-IISERB group at the
Indian Institute of Science Education and Research Bhopal (IISERB) participated
in this task and submitted five runs for five different models. The paper
presents the overview of the models and other significant findings on the
training corpus. The methods involve different feature engineering schemes and
text classification techniques. The performance of the classical bag of words
model and transformer-based models were explored to identify significant
features from the given training corpus. We have explored different classifiers
viz., random forest, support vector machine and logistic regression using the
bag of words model. Furthermore, the pre-trained transformer based models like
BERT, RoBERT and ALBERT were also used by fine-tuning them on the given
training corpus. The performance of different such models over the training
corpus were reported and the best five models were implemented on the given
test corpus. The empirical results show that the bag of words model outperforms
the transformer based models, however, the performance of our runs are not
reasonably well in both training and test corpus. This paper also addresses the
limitations of the models and scope for further improvement.
Related papers
- llm-jp-modernbert: A ModernBERT Model Trained on a Large-Scale Japanese Corpus with Long Context Length [1.5857828218932415]
We present llm-jp-modernbert, a ModernBERT model trained on a publicly available, massive Japanese corpus with a context length of 8192 tokens.
While our model does not surpass existing baselines on downstream tasks, it achieves good results on fill-mask test evaluations.
arXiv Detail & Related papers (2025-04-22T02:45:19Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Extensive Evaluation of Transformer-based Architectures for Adverse Drug
Events Extraction [6.78974856327994]
Adverse Event (ADE) extraction is one of the core tasks in digital pharmacovigilance.
We evaluate 19 Transformer-based models for ADE extraction on informal texts.
At the end of our analyses, we identify a list of take-home messages that can be derived from the experimental data.
arXiv Detail & Related papers (2023-06-08T15:25:24Z) - On Robustness of Finetuned Transformer-based NLP Models [11.063628128069736]
We characterize changes between pretrained and finetuned language model representations across layers using two metrics: CKA and STIR.
GPT-2 representations are more robust than BERT and T5 across multiple types of input perturbations.
This study provides valuable insights into perturbation-specific weaknesses of popular Transformer-based models.
arXiv Detail & Related papers (2023-05-23T18:25:18Z) - Transformer-based approaches to Sentiment Detection [55.41644538483948]
We examined the performance of four different types of state-of-the-art transformer models for text classification.
The RoBERTa transformer model performs best on the test dataset with a score of 82.6% and is highly recommended for quality predictions.
arXiv Detail & Related papers (2023-03-13T17:12:03Z) - Artificial Interrogation for Attributing Language Models [0.0]
The challenge provides twelve open-sourced base versions of popular language models and twelve fine-tuned language models for text generation.
The goal of the contest is to identify which fine-tuned models originated from which base model.
We have employed four distinct approaches for measuring the resemblance between the responses generated from the models of both sets.
arXiv Detail & Related papers (2022-11-20T05:46:29Z) - Masked Autoencoders As The Unified Learners For Pre-Trained Sentence
Representation [77.47617360812023]
We extend the recently proposed MAE style pre-training strategy, RetroMAE, to support a wide variety of sentence representation tasks.
The first stage performs RetroMAE over generic corpora, like Wikipedia, BookCorpus, etc., from which the base model is learned.
The second stage takes place on domain-specific data, e.g., MS MARCO and NLI, where the base model is continuingly trained based on RetroMAE and contrastive learning.
arXiv Detail & Related papers (2022-07-30T14:34:55Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - Interpreting Language Models Through Knowledge Graph Extraction [42.97929497661778]
We compare BERT-based language models through snapshots of acquired knowledge at sequential stages of the training process.
We present a methodology to unveil a knowledge acquisition timeline by generating knowledge graph extracts from cloze "fill-in-the-blank" statements.
We extend this analysis to a comparison of pretrained variations of BERT models (DistilBERT, BERT-base, RoBERTa)
arXiv Detail & Related papers (2021-11-16T15:18:01Z) - Explanation-Guided Training for Cross-Domain Few-Shot Classification [96.12873073444091]
Cross-domain few-shot classification task (CD-FSC) combines few-shot classification with the requirement to generalize across domains represented by datasets.
We introduce a novel training approach for existing FSC models.
We show that explanation-guided training effectively improves the model generalization.
arXiv Detail & Related papers (2020-07-17T07:28:08Z) - Document Ranking with a Pretrained Sequence-to-Sequence Model [56.44269917346376]
We show how a sequence-to-sequence model can be trained to generate relevance labels as "target words"
Our approach significantly outperforms an encoder-only model in a data-poor regime.
arXiv Detail & Related papers (2020-03-14T22:29:50Z) - Gradient-Based Adversarial Training on Transformer Networks for
Detecting Check-Worthy Factual Claims [3.7543966923106438]
We introduce the first adversarially-regularized, transformer-based claim spotter model.
We obtain a 4.70 point F1-score improvement over current state-of-the-art models.
We propose a method to apply adversarial training to transformer models.
arXiv Detail & Related papers (2020-02-18T16:51:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.