Programming Language Agnostic Mining of Code and Language Pairs with
Sequence Labeling Based Question Answering
- URL: http://arxiv.org/abs/2203.10744v1
- Date: Mon, 21 Mar 2022 05:33:59 GMT
- Title: Programming Language Agnostic Mining of Code and Language Pairs with
Sequence Labeling Based Question Answering
- Authors: Changran Hu, Akshara Reddi Methukupalli, Yutong Zhou, Chen Wu, Yubo
Chen
- Abstract summary: Mining aligned natural language (NL) and programming language (PL) pairs is a critical task to NL-PL understanding.
We propose a Sequence Labeling based Question Answering (SLQA) method to mine NL-PL pairs in a PL-agnostic manner.
- Score: 15.733292367610627
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mining aligned natural language (NL) and programming language (PL) pairs is a
critical task to NL-PL understanding. Existing methods applied specialized
hand-crafted features or separately-trained models for each PL. However, they
usually suffered from low transferability across multiple PLs, especially for
niche PLs with less annotated data. Fortunately, a Stack Overflow answer post
is essentially a sequence of text and code blocks and its global textual
context can provide PL-agnostic supplementary information. In this paper, we
propose a Sequence Labeling based Question Answering (SLQA) method to mine
NL-PL pairs in a PL-agnostic manner. In particular, we propose to apply the BIO
tagging scheme instead of the conventional binary scheme to mine the code
solutions which are often composed of multiple blocks of a post. Experiments on
current single-PL single-block benchmarks and a manually-labeled cross-PL
multi-block benchmark prove the effectiveness and transferability of SLQA. We
further present a parallel NL-PL corpus named Lang2Code automatically mined
with SLQA, which contains about 1.4M pairs on 6 PLs. Under statistical analysis
and downstream evaluation, we demonstrate that Lang2Code is a large-scale
high-quality data resource for further NL-PL research.
Related papers
- CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation [24.468767564264738]
We present CrossPL, the first benchmark designed to evaluate large language models' (LLMs) ability to generate cross-programming-language (CPL) code.<n>CrossPL comprises 1,982 tasks centered around IPC, covering six widely-used programming languages and seven representative CPL techniques.<n>We evaluate 14 state-of-the-art general-purpose LLMs and 6 code-oriented LLMs released in the past three years on CrossPL via FSM-based validation.
arXiv Detail & Related papers (2025-07-26T10:28:39Z) - How Programming Concepts and Neurons Are Shared in Code Language Models [55.22005737371843]
We perform a few-shot translation task on 21 PL pairs using two Llama-based models.<n>We observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers.<n>We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers.
arXiv Detail & Related papers (2025-06-01T16:24:13Z) - Bridge-Coder: Unlocking LLMs' Potential to Overcome Language Gaps in Low-Resource Code [31.48411893252137]
Large Language Models (LLMs) demonstrate strong proficiency in generating code for high-resource programming languages (HRPLs) like Python but struggle significantly with low-resource programming languages (LRPLs) such as Racket or D.
This performance gap deepens the digital divide, preventing developers using LRPLs from benefiting equally from LLM advancements and reinforcing disparities in innovation within underrepresented programming communities.
We introduce a novel approach called Bridge-Coder, which leverages LLMs' intrinsic capabilities to enhance the performance on LRPLs.
arXiv Detail & Related papers (2024-10-24T17:55:03Z) - Position IDs Matter: An Enhanced Position Layout for Efficient Context Compression in Large Language Models [50.637714223178456]
We propose Enhanced Position Layout (EPL) to improve the context compression capability of large language models (LLMs)<n>EPL minimizes the distance between context tokens and their corresponding special tokens and at the same time maintains the sequence order in position IDs.<n>When extended to multimodal scenarios, EPL brings an average accuracy gain of 2.6 to vision compression LLMs.
arXiv Detail & Related papers (2024-09-22T08:51:18Z) - SSP: Self-Supervised Prompting for Cross-Lingual Transfer to Low-Resource Languages using Large Language Models [23.522223369054437]
Self-Supervised Prompting (SSP) is a novel ICL approach tailored for the zero-labelled cross-lingual transfer (0-CLT) setting.
SSP is based on the key observation that LLMs output more accurate labels if in-context exemplars are from the target language.
SSP strongly outperforms existing SOTA fine-tuned and prompting-based baselines in 0-CLT setup.
arXiv Detail & Related papers (2024-06-27T04:21:59Z) - Nearest Neighbor Speculative Decoding for LLM Generation and Attribution [87.3259169631789]
Nearest Speculative Decoding (NEST) is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources.
NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks.
In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B.
arXiv Detail & Related papers (2024-05-29T17:55:03Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the original code and its LLM-rewritten variants.
Our results demonstrate a significant improvement over existing SOTA synthetic content detectors.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - ANPL: Towards Natural Programming with Interactive Decomposition [33.58825633046242]
We introduce an interactive ANPL system that ensures users can always refine the generated code.
An ANPL program consists of a set of input-outputs that it must satisfy.
The user revises an ANPL program by either modifying the sketch, changing the language used to describe the holes, or providing additional input-outputs to a particular hole.
arXiv Detail & Related papers (2023-05-29T14:19:40Z) - ProgSG: Cross-Modality Representation Learning for Programs in
Electronic Design Automation [38.023395256208055]
High-level synthesis (HLS) allows a developer to compile a high-level description in the form of software code in C and C++.
HLS tools still require microarchitecture decisions, expressed in terms of pragmas.
We propose ProgSG allowing the source code sequence modality and the graph modalities to interact with each other in a deep and fine-grained way.
arXiv Detail & Related papers (2023-05-18T09:44:18Z) - MultiCoder: Multi-Programming-Lingual Pre-Training for Low-Resource Code
Completion [21.100570496144694]
We propose the MultiCoder to enhance the low-resource code completion via MultiPL pre-training and MultiPL Mixture-of-Experts layers.
We also propose a novel PL-level MoE routing strategy (PL-MoE) for improving the code completion on all PLs.
arXiv Detail & Related papers (2022-12-19T17:50:05Z) - Improving Mandarin End-to-End Speech Recognition with Word N-gram
Language Model [57.92200214957124]
External language models (LMs) are used to improve the recognition performance of end-to-end (E2E) automatic speech recognition (ASR) systems.
We propose a novel decoding algorithm where a word-level lattice is constructed on-the-fly to consider all possible word sequences.
Our method consistently outperforms subword-level LMs, including N-gram LM and neural network LM.
arXiv Detail & Related papers (2022-01-06T10:04:56Z) - CodeBERT: A Pre-Trained Model for Programming and Natural Languages [117.34242908773061]
CodeBERT is a pre-trained model for programming language (PL) and nat-ural language (NL)
We develop CodeBERT with Transformer-based neural architecture.
We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters.
arXiv Detail & Related papers (2020-02-19T13:09:07Z) - Synthetic Datasets for Neural Program Synthesis [66.20924952964117]
We propose a new methodology for controlling and evaluating the bias of synthetic data distributions over both programs and specifications.
We demonstrate, using the Karel DSL and a small Calculator DSL, that training deep networks on these distributions leads to improved cross-distribution generalization performance.
arXiv Detail & Related papers (2019-12-27T21:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.