Towards Leveraging Large Language Model Summaries for Topic Modeling in Source Code
- URL: http://arxiv.org/abs/2504.17426v1
- Date: Thu, 24 Apr 2025 10:30:40 GMT
- Title: Towards Leveraging Large Language Model Summaries for Topic Modeling in Source Code
- Authors: Michele Carissimi, Martina Saletta, Claudio Ferretti,
- Abstract summary: Large language models (LLMs) have demonstrated remarkable program comprehension capabilities.<n> transformer-based topic modeling techniques offer effective ways to extract semantic information from text.<n>This paper proposes and explores a novel approach that combines these strengths to automatically identify meaningful topics in a corpus of Python programs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding source code is a topic of great interest in the software engineering community, since it can help programmers in various tasks such as software maintenance and reuse. Recent advances in large language models (LLMs) have demonstrated remarkable program comprehension capabilities, while transformer-based topic modeling techniques offer effective ways to extract semantic information from text. This paper proposes and explores a novel approach that combines these strengths to automatically identify meaningful topics in a corpus of Python programs. Our method consists in applying topic modeling on the descriptions obtained by asking an LLM to summarize the code. To assess the internal consistency of the extracted topics, we compare them against topics inferred from function names alone, and those derived from existing docstrings. Experimental results suggest that leveraging LLM-generated summaries provides interpretable and semantically rich representation of code structure. The promising results suggest that our approach can be fruitfully applied in various software engineering tasks such as automatic documentation and tagging, code search, software reorganization and knowledge discovery in large repositories.
Related papers
- Bridging Textual-Collaborative Gap through Semantic Codes for Sequential Recommendation [91.13055384151897]
CoCoRec is a novel Code-based textual and Collaborative semantic fusion method for sequential Recommendation.<n>We generate fine-grained semantic codes from multi-view text embeddings through vector quantization techniques.<n>In order to further enhance the fusion of textual and collaborative semantics, we introduce an optimization strategy.
arXiv Detail & Related papers (2025-03-15T15:54:44Z) - Analysis on LLMs Performance for Code Summarization [0.0]
Large Language Models (LLMs) have significantly advanced the field of code summarization.<n>This study aims to perform a comparative analysis of several open-source LLMs, namely LLaMA-3, Phi-3, Mistral, and Gemma.
arXiv Detail & Related papers (2024-12-22T17:09:34Z) - Systematic Task Exploration with LLMs: A Study in Citation Text Generation [63.50597360948099]
Large language models (LLMs) bring unprecedented flexibility in defining and executing complex, creative natural language generation (NLG) tasks.
We propose a three-component research framework that consists of systematic input manipulation, reference data, and output measurement.
We use this framework to explore citation text generation -- a popular scholarly NLP task that lacks consensus on the task definition and evaluation metric.
arXiv Detail & Related papers (2024-07-04T16:41:08Z) - LILO: Learning Interpretable Libraries by Compressing and Documenting Code [71.55208585024198]
We introduce LILO, a neurosymbolic framework that iteratively synthesizes, compresses, and documents code.
LILO combines LLM-guided program synthesis with recent algorithmic advances in automated from Stitch.
We find that AutoDoc boosts performance by helping LILO's synthesizer to interpret and deploy learned abstractions.
arXiv Detail & Related papers (2023-10-30T17:55:02Z) - Exploring Large Language Models for Code Explanation [3.2570216147409514]
Large Language Models (LLMs) have made remarkable strides in Natural Language Processing.
This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs.
arXiv Detail & Related papers (2023-10-25T14:38:40Z) - Understanding Code Semantics: An Evaluation of Transformer Models in
Summarization [0.0]
We evaluate the efficacy of code summarization by altering function and variable names.
We introduce adversaries like dead code and commented code across three programming languages.
arXiv Detail & Related papers (2023-10-25T02:41:50Z) - Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for
Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT.
We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z) - Topic Discovery via Latent Space Clustering of Pretrained Language Model
Representations [35.74225306947918]
We propose a joint latent space learning and clustering framework built upon PLM embeddings.
Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery.
arXiv Detail & Related papers (2022-02-09T17:26:08Z) - Using Document Similarity Methods to create Parallel Datasets for Code
Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task.
We propose to use document similarity methods to create noisy parallel datasets of code.
We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z) - Leveraging Language to Learn Program Abstractions and Search Heuristics [66.28391181268645]
We introduce LAPS (Language for Abstraction and Program Search), a technique for using natural language annotations to guide joint learning of libraries and neurally-guided search models for synthesis.
When integrated into a state-of-the-art library learning system (DreamCoder), LAPS produces higher-quality libraries and improves search efficiency and generalization.
arXiv Detail & Related papers (2021-06-18T15:08:47Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.