On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words
- URL: http://arxiv.org/abs/2505.18444v4
- Date: Thu, 24 Jul 2025 15:44:52 GMT
- Title: On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words
- Authors: Christian D. Newman, Anthony Peruma, Eman Abdullah AlOmar, Mahie Crabbe, Syreen Banabilah, Reem S. AlSuhaibani, Michael J. Decker, Farhad Akhbardeh, Marcos Zampieri, Mohamed Wiem Mkaouer, Jonathan I. Maletic,
- Abstract summary: This paper investigates the linguistic structure of identifier names by extending the concept of grammar patterns.<n>The specific focus is on closed syntactic categories, which are rarely studied in software engineering.<n>The relationship between closed-category grammar patterns and program behavior is then analyzed using grounded-theory-inspired coding, statistical, and pattern analysis.
- Score: 19.94735883254009
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Identifier names are crucial components of code, serving as primary clues for developers to understand program behavior. This paper investigates the linguistic structure of identifier names by extending the concept of grammar patterns, which represent the part-of-speech (PoS) sequences underlying identifier phrases. The specific focus is on closed syntactic categories (e.g., prepositions, conjunctions, determiners), which are rarely studied in software engineering despite their central role in general natural language. To study these categories, the Closed Category Identifier Dataset (CCID), a new manually annotated dataset of 1,275 identifiers drawn from 30 open-source systems, is constructed and presented. The relationship between closed-category grammar patterns and program behavior is then analyzed using grounded-theory-inspired coding, statistical, and pattern analysis. The results reveal recurring structures that developers use to express concepts such as control flow, data transformation, temporal reasoning, and other behavioral roles through naming. This work contributes an empirical foundation for understanding how linguistic resources encode behavior in identifier names and supports new directions for research in naming, program comprehension, and education.
Related papers
- Identifier Name Similarities: An Exploratory Study [3.7420775485568294]
We present our preliminary findings on the occurrence of identifier name similarity in software projects.<n>We envision our initial taxonomy providing researchers with a platform to analyze and evaluate the impact of identifier name similarity on code comprehension, maintainability, and collaboration among developers.
arXiv Detail & Related papers (2025-07-24T04:13:26Z) - From Open-Vocabulary to Vocabulary-Free Semantic Segmentation [78.62232202171919]
Open-vocabulary semantic segmentation enables models to identify novel object categories beyond their training data.<n>Current approaches still rely on manually specified class names as input, creating an inherent bottleneck in real-world applications.<n>This work proposes a Vocabulary-Free Semantic pipeline, eliminating the need for predefined class vocabularies.
arXiv Detail & Related papers (2025-02-17T15:17:08Z) - How Important Is Tokenization in French Medical Masked Language Models? [7.866517623371908]
Subword tokenization has become the prevailing standard in the field of natural language processing (NLP)
This paper seeks to delve into the complexities of subword tokenization in French biomedical domain across a variety of NLP tasks.
We introduce an original tokenization strategy that integrates morpheme-enriched word segmentation into existing tokenization methods.
arXiv Detail & Related papers (2024-02-22T23:11:08Z) - Language Models As Semantic Indexers [78.83425357657026]
We introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model.
We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval.
arXiv Detail & Related papers (2023-10-11T18:56:15Z) - Assessment of Pre-Trained Models Across Languages and Grammars [7.466159270333272]
We aim to recover constituent and dependency structures by casting parsing as sequence labeling.
Our results show that pre-trained word vectors do not favor constituency representations of syntax over dependencies.
occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors.
arXiv Detail & Related papers (2023-09-20T09:23:36Z) - Multiview Identifiers Enhanced Generative Retrieval [78.38443356800848]
generative retrieval generates identifier strings of passages as the retrieval target.
We propose a new type of identifier, synthetic identifiers, that are generated based on the content of a passage.
Our proposed approach performs the best in generative retrieval, demonstrating its effectiveness and robustness.
arXiv Detail & Related papers (2023-05-26T06:50:21Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings and reasoning mechanisms is a significant challenge.<n>We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.<n>We demonstrate that generative models like GPT can accurately learn and reason over CFG-defined hierarchies and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - Disambiguation of Company names via Deep Recurrent Networks [101.90357454833845]
We propose a Siamese LSTM Network approach to extract -- via supervised learning -- an embedding of company name strings.
We analyse how an Active Learning approach to prioritise the samples to be labelled leads to a more efficient overall learning pipeline.
arXiv Detail & Related papers (2023-03-07T15:07:57Z) - Model Choices Influence Attributive Word Associations: A Semi-supervised
Analysis of Static Word Embeddings [0.0]
This work aims to assess attributive word associations across five different static word embedding architectures.
Our results reveal that the choice of the context learning flavor during embedding training (CBOW vs skip-gram) impacts the word association distinguishability and word embeddings' sensitivity to deviations in the training corpora.
arXiv Detail & Related papers (2020-12-14T22:27:18Z) - A Self-supervised Representation Learning of Sentence Structure for
Authorship Attribution [3.5991811164452923]
We propose a self-supervised framework for learning structural representations of sentences.
We evaluate the learned structural representations of sentences using different probing tasks, and subsequently utilize them in the authorship attribution task.
arXiv Detail & Related papers (2020-10-14T02:57:10Z) - OCoR: An Overlapping-Aware Code Retriever [15.531119719750807]
Given a natural language description, code retrieval aims to search for the most relevant code among a set of code.
Existing state-of-the-art approaches apply neural networks to code retrieval.
We propose a novel neural architecture named OCoR, where we introduce two specifically-designed components to capture overlaps.
arXiv Detail & Related papers (2020-08-12T09:43:35Z) - Interpretability Analysis for Named Entity Recognition to Understand
System Predictions and How They Can Improve [49.878051587667244]
We examine the performance of several variants of LSTM-CRF architectures for named entity recognition.
We find that context representations do contribute to system performance, but that the main factor driving high performance is learning the name tokens themselves.
We enlist human annotators to evaluate the feasibility of inferring entity types from the context alone and find that, while people are not able to infer the entity type either for the majority of the errors made by the context-only system, there is some room for improvement.
arXiv Detail & Related papers (2020-04-09T14:37:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.