Domain-Specific Language Model Post-Training for Indonesian Financial
NLP
- URL: http://arxiv.org/abs/2310.09736v1
- Date: Sun, 15 Oct 2023 05:07:08 GMT
- Title: Domain-Specific Language Model Post-Training for Indonesian Financial
NLP
- Authors: Ni Putu Intan Maharani, Yoga Yustiawan, Fauzy Caesar Rochim, Ayu
Purwarianti
- Abstract summary: BERT and IndoBERT have achieved impressive performance in several NLP tasks.
We focus on financial domain and Indonesian language, where we perform post-training on pre-trained IndoBERT for financial domain.
- Score: 1.8377013498056056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: BERT and IndoBERT have achieved impressive performance in several NLP tasks.
There has been several investigation on its adaption in specialized domains
especially for English language. We focus on financial domain and Indonesian
language, where we perform post-training on pre-trained IndoBERT for financial
domain using a small scale of Indonesian financial corpus. In this paper, we
construct an Indonesian self-supervised financial corpus, Indonesian financial
sentiment analysis dataset, Indonesian financial topic classification dataset,
and release a family of BERT models for financial NLP. We also evaluate the
effectiveness of domain-specific post-training on sentiment analysis and topic
classification tasks. Our findings indicate that the post-training increases
the effectiveness of a language model when it is fine-tuned to domain-specific
downstream tasks.
Related papers
- Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities.
We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets.
We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z) - Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts.
Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models.
The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z) - No Language is an Island: Unifying Chinese and English in Financial Large Language Models, Instruction Data, and Benchmarks [75.29561463156635]
ICE-PIXIU uniquely integrates a spectrum of Chinese tasks, alongside translated and original English datasets.
It provides unrestricted access to diverse model variants, a compilation of diverse cross-lingual and multi-modal instruction data, and an evaluation benchmark with expert annotations.
arXiv Detail & Related papers (2024-03-10T16:22:20Z) - Is ChatGPT a Financial Expert? Evaluating Language Models on Financial
Natural Language Processing [22.754757518792395]
FinLMEval is a framework for Financial Language Model Evaluation.
This study compares the performance of encoder-only language models and the decoder-only language models.
arXiv Detail & Related papers (2023-10-19T11:43:15Z) - Domain Adaptation for Arabic Machine Translation: The Case of Financial
Texts [0.7673339435080445]
We develop a parallel corpus for Arabic-English (AR- EN) translation in the financial domain.
We fine-tune several NMT and Large Language models including ChatGPT-3.5 Turbo.
The quality of ChatGPT translation was superior than other models based on automatic and human evaluations.
arXiv Detail & Related papers (2023-09-22T13:37:19Z) - Removing Non-Stationary Knowledge From Pre-Trained Language Models for
Entity-Level Sentiment Classification in Finance [0.0]
We build KorFinASC, a Korean aspect-level sentiment classification dataset for finance consisting of 12,613 human-annotated samples.
We use the term "non-stationary knowledge'' to refer to information that was previously correct but is likely to change, and present "TGT-Masking'', a novel masking pattern.
arXiv Detail & Related papers (2023-01-09T01:26:55Z) - One Country, 700+ Languages: NLP Challenges for Underrepresented
Languages and Dialects in Indonesia [60.87739250251769]
We provide an overview of the current state of NLP research for Indonesia's 700+ languages.
We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
arXiv Detail & Related papers (2022-03-24T22:07:22Z) - FinEAS: Financial Embedding Analysis of Sentiment [0.0]
We introduce a new language representation model in finance called Financial Embedding Analysis of Sentiment (FinEAS)
In this work, we propose a new model for financial sentiment analysis based on supervised fine-tuned sentence embeddings from a standard BERT model.
arXiv Detail & Related papers (2021-10-31T15:41:56Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - FinBERT: A Pretrained Language Model for Financial Communications [25.900063840368347]
There is no pretrained finance specific language models available.
We address the need by pretraining a financial domain specific BERT models, FinBERT, using a large scale of financial communication corpora.
Experiments on three financial sentiment classification tasks confirm the advantage of FinBERT over generic domain BERT model.
arXiv Detail & Related papers (2020-06-15T02:51:06Z) - DomBERT: Domain-oriented Language Model for Aspect-based Sentiment
Analysis [71.40586258509394]
We propose DomBERT, an extension of BERT to learn from both in-domain corpus and relevant domain corpora.
Experiments are conducted on an assortment of tasks in aspect-based sentiment analysis, demonstrating promising results.
arXiv Detail & Related papers (2020-04-28T21:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.