Fugu-MT 論文翻訳(概要): Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

論文の概要: Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

arxiv url: http://arxiv.org/abs/2602.14675v1
Date: Mon, 16 Feb 2026 12:02:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-17 16:22:50.388404
Title: Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
Title（参考訳）: LLMを非標準オーソグラフィーでテストするためのPedmonteseのクラウドソーシング
Authors: Gianluca Vico, Jindřich Libovický,
Abstract要約: このデータセットは、フロレス+から派生した145のイタリア・ピエモンテの並列文からなる。このリソースを使用して、トークン化パリティ、トピック分類、機械翻訳に関するいくつかの大きな言語モデルをベンチマークする。
参考スコア（独自算出の注目度）: 1.3873397698625443
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.
Abstract（参考訳）: 我々は、イタリア北西部の絶滅危惧言語であるピエモンテ語について、クラウドソースでデータセットを提示する。このデータセットは、フロレス+から派生した145のイタリア語とピエモンテ語の並行文で構成されており、手動の単語アライメントとともに標準化された慣習に固執するのではなく、話者が自然な正書法で書くことによって翻訳されている。このリソースを使用して、トークン化パリティ、トピック分類、機械翻訳に関するいくつかの大きな言語モデルをベンチマークする。我々の分析によると、ピードモント語は高資源のロマンス語と比較してトークン化のペナルティをもたらしているが、LLMはイタリア語、フランス語、英語に近づいた分類性能を達成している。機械翻訳の結果は非対称であり、モデルはピードモント語からハイソース言語に適切に翻訳されるが、パイドモント語への生成は依然として困難である。データセットとコードは公開されている。

論文の概要: Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

関連論文リスト