Towards Effective Ancient Chinese Translation: Dataset, Model, and
Evaluation
- URL: http://arxiv.org/abs/2308.00240v1
- Date: Tue, 1 Aug 2023 02:43:27 GMT
- Title: Towards Effective Ancient Chinese Translation: Dataset, Model, and
Evaluation
- Authors: Geyang Guo, Jiarong Yang, Fengyuan Lu, Jiaxin Qin, Tianyi Tang, Wayne
Xin Zhao
- Abstract summary: In this paper, we propose Erya for ancient Chinese translation.
From a dataset perspective, we collect, clean, and classify ancient Chinese materials from various sources.
From a model perspective, we devise Erya training method oriented towards ancient Chinese.
- Score: 28.930640246972516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Interpreting ancient Chinese has been the key to comprehending vast Chinese
literature, tradition, and civilization. In this paper, we propose Erya for
ancient Chinese translation. From a dataset perspective, we collect, clean, and
classify ancient Chinese materials from various sources, forming the most
extensive ancient Chinese resource to date. From a model perspective, we devise
Erya training method oriented towards ancient Chinese. We design two
jointly-working tasks: disyllabic aligned substitution (DAS) and dual masked
language model (DMLM). From an evaluation perspective, we build a benchmark to
judge ancient Chinese translation quality in different scenarios and evaluate
the ancient Chinese translation capacities of various existing models. Our
model exhibits remarkable zero-shot performance across five domains, with over
+12.0 BLEU against GPT-3.5 models and better human evaluation results than
ERNIE Bot. Subsequent fine-tuning further shows the superior transfer
capability of Erya model with +6.2 BLEU gain. We release all the
above-mentioned resources at https://github.com/RUCAIBox/Erya.
Related papers
- When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction [73.26364649572237]
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world.
A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today.
This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
arXiv Detail & Related papers (2024-06-05T07:34:39Z) - Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding [57.22231959529641]
Hunyuan-DiT is a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese.
For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images.
arXiv Detail & Related papers (2024-05-14T16:33:25Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Can Large Language Model Comprehend Ancient Chinese? A Preliminary Test
on ACLUE [23.598825660594926]
ACLUE is an evaluation benchmark designed to assess the capability of language models in comprehending ancient Chinese.
We observed a noticeable disparity in their performance between modern Chinese and ancient Chinese.
ChatGLM2 demonstrates the most remarkable performance, achieving an average score of 37.4%.
arXiv Detail & Related papers (2023-10-14T10:06:39Z) - Efficient and Effective Text Encoding for Chinese LLaMA and Alpaca [23.00353889531171]
We propose a method to augment LLaMA with capabilities for understanding and generating Chinese text.
We incorporate secondary pre-training using Chinese data and fine-tune the model with Chinese instruction datasets.
Results on the C-Eval dataset yield competitive performance among the models with several times the size of ours.
arXiv Detail & Related papers (2023-04-17T11:39:53Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z) - ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality
Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee.
It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z) - AnchiBERT: A Pre-Trained Model for Ancient ChineseLanguage Understanding
and Generation [22.08457469951396]
AnchiBERT is a pre-trained language model based on the architecture of BERT.
We evaluate AnchiBERT on both language understanding and generation tasks, including poem classification.
arXiv Detail & Related papers (2020-09-24T03:41:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.