Related papers: LSPRAG: LSP-Guided RAG for Language-Agnostic Real-Time Unit Test Generation

LSPRAG: LSP-Guided RAG for Language-Agnostic Real-Time Unit Test Generation

URL: http://arxiv.org/abs/2510.22210v1
Date: Sat, 25 Oct 2025 08:19:21 GMT
Title: LSPRAG: LSP-Guided RAG for Language-Agnostic Real-Time Unit Test Generation
Authors: Gwihwan Go, Quan Zhang, Chijin Zhou, Zhao Wei, Yu Jiang,
Abstract summary: We present LSPRAG, a framework for concise-context retrieval tailored for real-time, language-agnostic unit test generation.<n>By reusing mature Language Server Protocol (LSP) back-ends, LSPRAG provides an LLM with language-aware context retrieval.<n>Compared to the best performance of baselines, LSPRAG increased line coverage by up to 174.55% for Golang, 213.31% for Java, and 31.57% for Python.
Score: 19.781961858094398
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated unit test generation is essential for robust software development, yet existing approaches struggle to generalize across multiple programming languages and operate within real-time development. While Large Language Models (LLMs) offer a promising solution, their ability to generate high coverage test code depends on prompting a concise context of the focal method. Current solutions, such as Retrieval-Augmented Generation, either rely on imprecise similarity-based searches or demand the creation of costly, language-specific static analysis pipelines. To address this gap, we present LSPRAG, a framework for concise-context retrieval tailored for real-time, language-agnostic unit test generation. LSPRAG leverages off-the-shelf Language Server Protocol (LSP) back-ends to supply LLMs with precise symbol definitions and references in real time. By reusing mature LSP servers, LSPRAG provides an LLM with language-aware context retrieval, requiring minimal per-language engineering effort. We evaluated LSPRAG on open-source projects spanning Java, Go, and Python. Compared to the best performance of baselines, LSPRAG increased line coverage by up to 174.55% for Golang, 213.31% for Java, and 31.57% for Python.

Related papers

Bridging the Knowledge Void: Inference-time Acquisition of Unfamiliar Programming Languages for Coding Tasks [22.908904483320953]
Large Language Models (LLMs) in coding tasks are often a reflection of their extensive pre-training corpora.<n>We propose ILA-agent, a general ILA framework that equips LLMs with a set of behavioral primitives.<n>We instantiate ILA-agent for Cangjie and evaluate its performance across code generation, translation, and program repair tasks.
arXiv Detail & Related papers (2026-01-16T09:06:47Z)
A Multi-Language Object-Oriented Programming Benchmark for Large Language Models [61.267115598083315]
A survey of 35 existing benchmarks uncovers three major imbalances.<n>85.7% focus on a single programming language.<n>94.3% target only function-level or statement-level tasks.<n>Over 80% include fewer than ten test cases on average.
arXiv Detail & Related papers (2025-09-30T11:30:08Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language [26.88208349402451]
We propose MUG-Eval, a novel framework that evaluates large language models' multilingual generation capabilities.<n>We transform existing benchmarks into conversational tasks and measure the LLMs' accuracies on those tasks.<n>We evaluate 8 LLMs across 30 languages spanning high, mid, and low-resource categories, and we find that MUG-Eval correlates strongly with established benchmarks.
arXiv Detail & Related papers (2025-05-20T14:14:00Z)
Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation [1.7268889851975326]
We introduce WebApp1K, a novel benchmark for evaluating large language models (LLMs) in test-driven development (TDD) tasks.<n>Unlike traditional approaches relying on natural language prompts, our benchmark emphasizes the ability of LLMs to interpret and implement functionality directly from test cases.
arXiv Detail & Related papers (2025-05-13T23:47:12Z)
mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation [28.531581489405745]
mHumanEval is an extended benchmark supporting prompts in over 200 natural languages.<n>We provide expert human translations for 15 diverse natural languages (NLs)<n>We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs.
arXiv Detail & Related papers (2024-10-19T08:44:26Z)
An Empirical Study of Large Language Models for Type and Call Graph Analysis in Python and JavaScript [3.385461018649221]
Large Language Models (LLMs) are increasingly being explored for their potential in software engineering.<n>We investigate the potential of current LLMs to enhance call-graph analysis and type inference for Python and JavaScript programs.<n>We empirically evaluated 24 LLMs, including OpenAI's GPT series and open-source models like LLaMA and Mistral.
arXiv Detail & Related papers (2024-10-01T11:44:29Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.1875460416205]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.<n>It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.<n>Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods. Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.