Fugu-MT 論文翻訳(概要): Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost

論文の概要: Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost

arxiv url: http://arxiv.org/abs/2509.07829v1
Date: Tue, 09 Sep 2025 15:07:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-10 14:38:27.371135
Title: Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost
Title（参考訳）: 低リソース・リテラル翻訳における大容量モデルとの相似性を考慮した小型オープンモデル
Authors: Mihai Nadas, Laura Diosan, Andreea Tomescu, Andrei Piscoran,
Abstract要約: TINYFABULIST Translation FRAMEWORK (TF2) は、英語・ルーマニア語文学翻訳におけるデータセット作成、微調整、評価のための統合されたフレームワークである。 DS-TF1-EN-3M (TF1) 上に構築され,ルーマニア語などの低リソース言語におけるリッチで高品質な文芸データセットの必要性に対処する。
参考スコア（独自算出の注目度）: 0.5599792629509229
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset creation, fine tuning, and evaluation in English-Romanian literary translations, centred on the creation and open release of both a compact, fine tuned language model (TF2-12B) and large scale synthetic parallel datasets (DS-TF2-EN-RO-3M and DS-TF2-EN-RO-15K). Building on DS-TF1-EN-3M (TF1), the largest collection of synthetic English fables to date, we address the need for rich, high quality literary datasets in low resource languages such as Romanian. Our pipeline first generates 15k high quality Romanian references from the TF1 pool using a high performing LLM. We then apply a two stage fine tuning process to a 12B parameter open weight model: (i) instruction tuning to capture genre specific narrative style, and (ii) adapter compression for efficient deployment. Evaluation combines corpus level BLEU and a five dimension LLM based rubric (accuracy, fluency, coherence, style, cultural adaptation) to provide a nuanced assessment of translation quality. Results show that our fine tuned model achieves fluency and adequacy competitive with top performing large proprietary models, while being open, accessible, and significantly more cost effective. Alongside the fine tuned model and both datasets, we publicly release all scripts and evaluation prompts. TF2 thus provides an end-to-end, reproducible pipeline for research on cost efficient translation, cross lingual narrative generation, and the broad adoption of open models for culturally significant literary content in low resource settings.
Abstract（参考訳）: 最近、機械翻訳研究において、文学翻訳は別々かつ複雑な課題として注目されている。しかし、小さなオープンモデルによる翻訳は依然として未解決の問題である。本稿では, TINYFABULIST Translation FRAMEWORK (TF2) を導入し, コンパクトかつ微調整された言語モデル (TF2-12B) と大規模合成並列データセット (DS-TF2-EN-RO-3M, DS-TF2-EN-RO-15K) の作成と公開を主眼とした, 英語・ルーマニア語文翻訳におけるデータセット作成, 微調整, 評価のための統合フレームワークについて紹介する。 DS-TF1-EN-3M (TF1) 上に構築されており、ルーマニア語などの低資源言語におけるリッチで高品質な文芸データセットの必要性に対処している。我々のパイプラインはまず、高性能LLMを用いてTF1プールから15kの高品質ルーマニア語参照を生成する。次に、12Bパラメータオープンウェイトモデルに2段階の微調整プロセスを適用する。一ジャンル固有の物語のスタイルを捉えて指導すること、 (ii)効率的な配置のためのアダプタ圧縮。評価は、コーパスレベルBLEUと5次元LLMベースのルーリック(精度、流派、コヒーレンス、スタイル、文化的適応)を組み合わせて、翻訳品質の微妙な評価を提供する。その結果,我々の微調整モデルは,オープンでアクセシビリティが高く,コスト効率も大幅に向上しつつ,大規模プロプライエタリモデルと競合する頻度と効率性を実現していることがわかった。微調整モデルと両方のデータセットに加えて、すべてのスクリプトと評価プロンプトを公開しています。 TF2は、コスト効率のよい翻訳、言語横断の物語生成、低リソース環境における文化的に重要な文学コンテンツに対するオープンモデルの普及に関する研究のために、エンドツーエンドで再現可能なパイプラインを提供する。

論文の概要: Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost

関連論文リスト