Related papers: The (ab)use of Open Source Code to Train Large Language Models

Related papers

Can LLMs Replace Humans During Code Chunking? [2.4056836012742]
Large language models (LLMs) have become essential tools in computer science, especially for tasks involving code understanding and generation.<n>This paper examines the application of LLMs in the modernization of legacy government code written in ALC and MUMPS.
arXiv Detail & Related papers (2025-06-24T13:02:35Z)
Large Language Model Unlearning for Source Code [65.42425213605114]
PROD is a novel unlearning approach that enables LLMs to forget undesired code content while preserving their code generation capabilities.<n>Our evaluation demonstrates that PROD achieves superior balance between forget quality and model utility compared to existing unlearning approaches.
arXiv Detail & Related papers (2025-06-20T16:27:59Z)
Seed-Coder: Let the Code Model Curate Data for Itself [41.69830870792126]
We introduce Seed-Coder, a series of open-source models of 8B size.<n>Our code pretraining data is produced by a model-centric data pipeline.<n>Seed-Coder achieves state-of-the-art results among open-source models of similar size.
arXiv Detail & Related papers (2025-06-04T03:17:19Z)
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z)
Pragmatic Reasoning improves LLM Code Generation [35.78260347663757]
We propose CodeRSA, a novel code candidate reranking mechanism built upon the Rational Speech Act (RSA) framework. We evaluate CodeRSA using one of the latest Large Language Models on a popular code generation dataset.
arXiv Detail & Related papers (2025-02-20T12:44:26Z)
Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian [6.2250765474961405]
We conduct a survey to explore the perspectives on code readability in the age of large language models (LLMs)<n>We investigate the readability of our LLM-based software development agents framework, HULA, by comparing its generated code with human-written code in real-world scenarios.
arXiv Detail & Related papers (2025-01-20T04:11:21Z)
A Survey on Large Language Models for Code Generation [9.555952109820392]
Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks. This survey aims to bridge the gap between academia and practical development by providing a comprehensive and up-to-date literature review.
arXiv Detail & Related papers (2024-06-01T17:48:15Z)
Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns [7.776434991976473]
This paper studies the deobfuscation capabilities of large language models (LLMs) We evaluate four LLMs with real-world malicious scripts used in the notorious Emotet malware campaign. Our results indicate that while not absolutely accurate yet, some LLMs can efficiently deobfuscate such payloads.
arXiv Detail & Related papers (2024-04-30T17:06:27Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks. LLMs often struggle to perform well on low-resource languages because there is so little training data available. In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z)
Assured LLM-Based Software Engineering [51.003878077888686]
This paper is an outline of the content of the keynote by Mark Harman at the International Workshop on Interpretability, Robustness, and Benchmarking in Neural Software Engineering, Monday 15th April 2024, Lisbon, Portugal.
arXiv Detail & Related papers (2024-02-06T20:38:46Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts. CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z)
On the Safety of Open-Sourced Large Language Models: Does Alignment Really Prevent Them From Being Misused? [49.99955642001019]
We show that open-sourced, aligned large language models could be easily misguided to generate undesired content. Our key idea is to directly manipulate the generation process of open-sourced LLMs to misguide it to generate undesired content.
arXiv Detail & Related papers (2023-10-02T19:22:01Z)
Calculating Originality of LLM Assisted Source Code [0.0]
We propose a neural network-based tool to determine the original effort (and LLM's contribution) put by students in writing source codes. Our tool is motivated by minimum description length measures like Kolmogorov complexity.
arXiv Detail & Related papers (2023-07-10T11:30:46Z)
WizardCoder: Empowering Code Large Language Models with Evol-Instruct [67.24653703564492]
We introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning. Our model surpasses all other open-source Code LLMs by a substantial margin.
arXiv Detail & Related papers (2023-06-14T15:18:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.