Fugu-MT 論文翻訳(概要): SantaCoder: don't reach for the stars!

論文の概要: SantaCoder: don't reach for the stars!

arxiv url: http://arxiv.org/abs/2301.03988v1
Date: Mon, 9 Jan 2023 10:52:35 GMT
ステータス: 翻訳完了
システム内更新日: 2023-01-11 16:35:54.901028
Title: SantaCoder: don't reach for the stars!
Title（参考訳）: santacoder: 星に手を伸ばすな!
Authors: Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo Garc\'ia del R\'io, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra
Abstract要約: BigCodeプロジェクトは、コードのための大規模言語モデルの責任ある開発に取り組んでいる、オープン・サイエンティフィックなコラボレーションである。 The StackのJava,JavaScript,Pythonサブセットで1.1Bパラメータモデルをトレーニングし,MultiPL-Eのテキスト・トゥ・コードベンチマークで評価する。私たちの最良のモデルは、MultiPL-EのJava、JavaScript、Pythonの各部分の左から右への生成とインフィルで、以前のオープンソース多言語コード生成モデルより優れています。
参考スコア（独自算出の注目度）: 27.050410834027705
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode.
Abstract（参考訳）: bigcodeプロジェクトは、コードのための大きな言語モデルの責任ある開発に取り組んでいる、オープン科学的なコラボレーションである。この技術報告では、2022年12月までのコラボレーションの進捗を概説し、PII(Personally Identible Information)のリアクションパイプラインの現状、モデルアーキテクチャのリスクを下げるための実験、トレーニングデータに対するより良い事前処理方法の調査について概説する。 The StackのJava,JavaScript,Pythonサブセットで1.1Bパラメータモデルをトレーニングし,MultiPL-Eのテキスト・トゥ・コードベンチマークで評価する。 5つ以上のGitHubスターを持つリポジトリからファイルを選択することで、パフォーマンスが大幅に低下するのです。私たちの最良のモデルは、これまでのオープンソースのマルチリンガルコード生成モデル(incoder-6.7bとcodegen-multi-2.7b)よりも優れています。すべてのモデルは、https://hf.co/bigcodeでOpenRAILライセンスでリリースされている。

論文の概要: SantaCoder: don't reach for the stars!

関連論文リスト