Measuring The Impact Of Programming Language Distribution
- URL: http://arxiv.org/abs/2302.01973v3
- Date: Wed, 24 May 2023 16:20:33 GMT
- Title: Measuring The Impact Of Programming Language Distribution
- Authors: Gabriel Orlanski, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua
Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, Michele Catasta
- Abstract summary: We present the BabelCode framework for execution-based evaluation of any benchmark in any language.
We present a new code translation dataset called Translating Python Programming Puzzles (TP3)
We investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages.
- Score: 28.96076723773365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current benchmarks for evaluating neural code models focus on only a small
subset of programming languages, excluding many popular languages such as Go or
Rust. To ameliorate this issue, we present the BabelCode framework for
execution-based evaluation of any benchmark in any language. BabelCode enables
new investigations into the qualitative performance of models' memory, runtime,
and individual test case results. Additionally, we present a new code
translation dataset called Translating Python Programming Puzzles (TP3) from
the Python Programming Puzzles (Schuster et al. 2021) benchmark that involves
translating expert-level python functions to any language. With both BabelCode
and the TP3 benchmark, we investigate if balancing the distributions of 14
languages in a training dataset improves a large language model's performance
on low-resource languages. Training a model on a balanced corpus results in, on
average, 12.34% higher $pass@k$ across all tasks and languages compared to the
baseline. We find that this strategy achieves 66.48% better $pass@k$ on
low-resource languages at the cost of only a 12.94% decrease to high-resource
languages. In our three translation tasks, this strategy yields, on average,
30.77% better low-resource $pass@k$ while having 19.58% worse high-resource
$pass@k$.
Related papers
- CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts [51.49688654641581]
We propose a task and model agnostic approach called MultiPoT, which harnesses strength and diversity from various languages.
Experimental results reveal that it significantly outperforms Python Self-Consistency.
In particular, MultiPoT achieves more than 4.6% improvement on average on ChatGPT (gpt-3.5-turbo-0701)
arXiv Detail & Related papers (2024-02-16T13:48:06Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Detecting Languages Unintelligible to Multilingual Models through Local
Structure Probes [15.870989191524094]
We develop a general approach that requires only unlabelled text to detect which languages are not well understood by a cross-lingual model.
Our approach is derived from the hypothesis that if a model's understanding is insensitive to perturbations to text in a language, it is likely to have a limited understanding of that language.
arXiv Detail & Related papers (2022-11-09T16:45:16Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - OneAligner: Zero-shot Cross-lingual Transfer with One Rich-Resource
Language Pair for Low-Resource Sentence Retrieval [91.76575626229824]
We present OneAligner, an alignment model specially designed for sentence retrieval tasks.
When trained with all language pairs of a large-scale parallel multilingual corpus (OPUS-100), this model achieves the state-of-the-art result.
We conclude through empirical results and analyses that the performance of the sentence alignment task depends mostly on the monolingual and parallel data size.
arXiv Detail & Related papers (2022-05-17T19:52:42Z) - Leveraging Automated Unit Tests for Unsupervised Code Translation [34.84910520660154]
We propose to leverage an automated unit-testing system to filter out invalid translations.
We find that fine-tuning an unsupervised model with this filtered data set significantly reduces the noise in the translations so-generated.
In particular, for Java $to$ Python and Python $to$ C++ we outperform the best previous methods by more than 16% and 24% respectively.
arXiv Detail & Related papers (2021-10-13T15:08:43Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Deep Learning Models for Multilingual Hate Speech Detection [5.977278650516324]
In this paper, we conduct a large scale analysis of multilingual hate speech in 9 languages from 16 different sources.
We observe that in low resource setting, simple models such as LASER embedding with logistic regression performs the best.
In case of zero-shot classification, languages such as Italian and Portuguese achieve good results.
arXiv Detail & Related papers (2020-04-14T13:14:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.