Related papers: Measuring The Impact Of Programming Language Distribution

Measuring The Impact Of Programming Language Distribution

URL: http://arxiv.org/abs/2302.01973v3
Date: Wed, 24 May 2023 16:20:33 GMT
Title: Measuring The Impact Of Programming Language Distribution
Authors: Gabriel Orlanski, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, Michele Catasta
Abstract summary: We present the BabelCode framework for execution-based evaluation of any benchmark in any language. We present a new code translation dataset called Translating Python Programming Puzzles (TP3) We investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages.
Score: 28.96076723773365
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, we present the BabelCode framework for execution-based evaluation of any benchmark in any language. BabelCode enables new investigations into the qualitative performance of models' memory, runtime, and individual test case results. Additionally, we present a new code translation dataset called Translating Python Programming Puzzles (TP3) from the Python Programming Puzzles (Schuster et al. 2021) benchmark that involves translating expert-level python functions to any language. With both BabelCode and the TP3 benchmark, we investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages. Training a model on a balanced corpus results in, on average, 12.34% higher $pass@k$ across all tasks and languages compared to the baseline. We find that this strategy achieves 66.48% better $pass@k$ on low-resource languages at the cost of only a 12.94% decrease to high-resource languages. In our three translation tasks, this strategy yields, on average, 30.77% better low-resource $pass@k$ while having 19.58% worse high-resource $pass@k$.

Related papers

Evaluation of the Code Generation Capabilities of ChatGPT 4: A Comparative Analysis in 19 Programming Languages [0.0]
This thesis examines the capabilities of ChatGPT 4 in code generation across 19 programming languages. ChatGPT 4 successfully solved 39.67% of all tasks, with success rates decreasing significantly as problem complexity increased. The model exhibited above-average runtime efficiency in all programming languages.
arXiv Detail & Related papers (2025-01-04T17:17:01Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z)
Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts [51.49688654641581]
We propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages. Experimental results reveal that it significantly outperforms Python Self-Consistency. In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701)
arXiv Detail & Related papers (2024-02-16T13:48:06Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Detecting Languages Unintelligible to Multilingual Models through Local Structure Probes [15.870989191524094]
We develop a general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model. Our approach is derived from the hypothesis that if a model's understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language.
arXiv Detail & Related papers (2022-11-09T16:45:16Z)
No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages. We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks. Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z)
OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks. When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result. We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z)
Leveraging Automated Unit Tests for Unsupervised Code Translation [34.84910520660154]
We propose to leverage an automated unit-testing system to filter out invalid translations. We find that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated. In particular, for Java $to$ Python and Python $to$ C++ we outperform the best previous methods by more than 16% and 24% respectively.
arXiv Detail & Related papers (2021-10-13T15:08:43Z)
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
Deep Learning Models for Multilingual Hate Speech Detection [5.977278650516324]
In this paper, we conduct a large scale analysis of multilingual hate speech in 9 languages from 16 different sources. We observe that in low resource setting, simple models such as LASER embedding with logistic regression performs the best. In case of zero-shot classification, languages such as Italian and Portuguese achieve good results.
arXiv Detail & Related papers (2020-04-14T13:14:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.