Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages
- URL: http://arxiv.org/abs/2307.08714v1
- Date: Sun, 16 Jul 2023 00:45:42 GMT
- Title: Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages
- Authors: Sunisth Kumar, Davide Liu, Alexandre Boulenger
- Abstract summary: We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
- Score: 70.25418443146435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an efficient modeling framework for cross-lingual named entity
recognition in semi-structured text data. Our approach relies on both knowledge
distillation and consistency training. The modeling framework leverages
knowledge from a large language model (XLMRoBERTa) pre-trained on the source
language, with a student-teacher relationship (knowledge distillation). The
student model incorporates unsupervised consistency training (with KL
divergence loss) on the low-resource target language.
We employ two independent datasets of SMSs in English and Arabic, each
carrying semi-structured banking transaction information, and focus on
exhibiting the transfer of knowledge from English to Arabic. With access to
only 30 labeled samples, our model can generalize the recognition of merchants,
amounts, and other fields from English to Arabic. We show that our modeling
approach, while efficient, performs best overall when compared to
state-of-the-art approaches like DistilBERT pre-trained on the target language
or a supervised model directly trained on labeled data in the target language.
Our experiments show that it is enough to learn to recognize entities in
English to reach reasonable performance in a low-resource language in the
presence of a few labeled samples of semi-structured data. The proposed
framework has implications for developing multi-lingual applications,
especially in geographies where digital endeavors rely on both English and one
or more low-resource language(s), sometimes mixed with English or employed
singly.
Related papers
- Building Dialogue Understanding Models for Low-resource Language Indonesian from Scratch [31.50694642284321]
We propose a Bi-Confidence-Frequency Cross-Lingual transfer framework (BiCF) to tackle the obstacle of lacking low-resource language dialogue data.
Our framework performs reliably and cost-efficiently on different scales of manually annotated Indonesian data.
arXiv Detail & Related papers (2024-10-24T04:33:14Z) - Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages.
This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z) - CroissantLLM: A Truly Bilingual French-English Language Model [42.03897426049679]
We introduce CroissantLLM, a 1.3B language model pretrained on a set of 3T English and French tokens.
We pioneer the approach of training an intrinsically bilingual model with a 1:1 English-to-French pretraining data ratio.
To assess performance outside of English, we craft a novel benchmark, FrenchBench.
arXiv Detail & Related papers (2024-02-01T17:17:55Z) - MergeDistill: Merging Pre-trained Language Models using Distillation [5.396915402673246]
We propose MergeDistill, a framework to merge pre-trained LMs in a way that can best leverage their assets with minimal dependencies.
We demonstrate the applicability of our framework in a practical setting by leveraging pre-existing teacher LMs and training student LMs that perform competitively with or even outperform teacher LMs trained on several orders of magnitude more data and with a fixed model capacity.
arXiv Detail & Related papers (2021-06-05T08:22:05Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z) - Cross-lingual Information Retrieval with BERT [8.052497255948046]
We explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents.
A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision.
Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.
arXiv Detail & Related papers (2020-04-24T23:32:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.