A Comprehensive Study on the Use of Word Embedding Models in Software Engineering Domain
- URL: http://arxiv.org/abs/2505.17634v1
- Date: Fri, 23 May 2025 08:52:29 GMT
- Title: A Comprehensive Study on the Use of Word Embedding Models in Software Engineering Domain
- Authors: Xiaohan Chen, Weiqin Zou, Lianyi Zhi, Qianshuang Meng, Jingxuan Zhang,
- Abstract summary: This study focuses on the use of word embedding (WE) models in the software engineering (SE) domain.<n> 181 primary studies published in mainstream software engineering venues are collected for analysis.<n>We get a systematical view of the current practice of using WE for the SE domain, and figure out the challenges and actions in adopting or developing practical semantic representation approaches for the SE artifacts used in a series of SE tasks.
- Score: 16.40945129377773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Word embedding (WE) techniques are advanced textual semantic representation models oriented from the natural language processing (NLP) area. Inspired by their effectiveness in facilitating various NLP tasks, more and more researchers attempt to adopt these WE models for their software engineering (SE) tasks, of which semantic representation of software artifacts such as bug reports and code snippets is the basis for further model building. However, existing studies are generally isolated from each other without comprehensive comparison and discussion. This not only makes the best practice of such cross-discipline technique adoption buried in scattered papers, but also makes us kind of blind to current progress in the semantic representation of SE artifacts. To this end, we decided to perform a comprehensive study on the use of WE models in the SE domain. 181 primary studies published in mainstream software engineering venues are collected for analysis. Several research questions related to the SE applications, the training strategy of WE models, the comparison with traditional semantic representation methods, etc., are answered. With the answers, we get a systematical view of the current practice of using WE for the SE domain, and figure out the challenges and actions in adopting or developing practical semantic representation approaches for the SE artifacts used in a series of SE tasks.
Related papers
- Vision Generalist Model: A Survey [87.49797517847132]
We provide a comprehensive overview of the vision generalist models, delving into their characteristics and capabilities within the field.<n>We take a brief excursion into related domains, shedding light on their interconnections and potential synergies.
arXiv Detail & Related papers (2025-06-11T17:23:41Z) - Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts.
Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models.
The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z) - Enhancing Software-Related Information Extraction via Single-Choice Question Answering with Large Language Models [3.6637903428898055]
This paper focuses on improving relation extraction in scholarly texts through generative Large Language Models.
The methodology prioritises the use of in-context learning capabilities of GLMs to extract software-related entities.
Our participation in the SOMD shared task highlights the importance of precise software citation practices.
arXiv Detail & Related papers (2024-04-08T15:00:36Z) - Large Language Models for Generative Information Extraction: A Survey [89.71273968283616]
Large Language Models (LLMs) have demonstrated remarkable capabilities in text understanding and generation.
We present an extensive overview by categorizing these works in terms of various IE subtasks and techniques.
We empirically analyze the most advanced methods and discover the emerging trend of IE tasks with LLMs.
arXiv Detail & Related papers (2023-12-29T14:25:22Z) - Combatting Human Trafficking in the Cyberspace: A Natural Language
Processing-Based Methodology to Analyze the Language in Online Advertisements [55.2480439325792]
This project tackles the pressing issue of human trafficking in online C2C marketplaces through advanced Natural Language Processing (NLP) techniques.
We introduce a novel methodology for generating pseudo-labeled datasets with minimal supervision, serving as a rich resource for training state-of-the-art NLP models.
A key contribution is the implementation of an interpretability framework using Integrated Gradients, providing explainable insights crucial for law enforcement.
arXiv Detail & Related papers (2023-11-22T02:45:01Z) - A Survey on Semantic Processing Techniques [38.32578417623237]
The study of semantics is multi-dimensional in linguistics.
The research depth and breadth of computational semantic processing can be largely improved with new technologies.
arXiv Detail & Related papers (2023-10-22T15:09:51Z) - Foundation Models for Decision Making: Problems, Methods, and
Opportunities [124.79381732197649]
Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks.
New paradigms are emerging for training foundation models to interact with other agents and perform long-term reasoning.
Research at the intersection of foundation models and decision making holds tremendous promise for creating powerful new systems.
arXiv Detail & Related papers (2023-03-07T18:44:07Z) - The Use of NLP-Based Text Representation Techniques to Support
Requirement Engineering Tasks: A Systematic Mapping Review [1.5469452301122177]
The research direction has changed from the use of lexical and syntactic features to the use of advanced embedding techniques.
We identify four gaps in the existing literature, why they matter, and how future research can begin to address them.
arXiv Detail & Related papers (2022-05-17T02:47:26Z) - Knowledge-Aware Procedural Text Understanding with Multi-Stage Training [110.93934567725826]
We focus on the task of procedural text understanding, which aims to comprehend such documents and track entities' states and locations during a process.
Two challenges, the difficulty of commonsense reasoning and data insufficiency, still remain unsolved.
We propose a novel KnOwledge-Aware proceduraL text understAnding (KOALA) model, which effectively leverages multiple forms of external knowledge.
arXiv Detail & Related papers (2020-09-28T10:28:40Z) - A Systematic Literature Review on the Use of Deep Learning in Software
Engineering Research [22.21817722054742]
An increasingly popular set of techniques adopted by software engineering (SE) researchers to automate development tasks are those rooted in the concept of Deep Learning (DL)
This paper presents a systematic literature review of research at the intersection of SE & DL.
We center our analysis around the components of learning, a set of principles that govern the application of machine learning techniques to a given problem domain.
arXiv Detail & Related papers (2020-09-14T15:28:28Z) - Distributional semantic modeling: a revised technique to train term/word
vector space models applying the ontology-related approach [36.248702416150124]
We design a new technique for the distributional semantic modeling with a neural network-based approach to learn distributed term representations (or term embeddings)
Vec2graph is a Python library for visualizing word embeddings (term embeddings in our case) as dynamic and interactive graphs.
arXiv Detail & Related papers (2020-03-06T18:27:39Z) - How Far are We from Effective Context Modeling? An Exploratory Study on
Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it.
We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.