Domain-Specific Language Model Post-Training for Indonesian Financial
NLP
- URL: http://arxiv.org/abs/2310.09736v1
- Date: Sun, 15 Oct 2023 05:07:08 GMT
- Title: Domain-Specific Language Model Post-Training for Indonesian Financial
NLP
- Authors: Ni Putu Intan Maharani, Yoga Yustiawan, Fauzy Caesar Rochim, Ayu
Purwarianti
- Abstract summary: BERT and IndoBERT have achieved impressive performance in several NLP tasks.
We focus on financial domain and Indonesian language, where we perform post-training on pre-trained IndoBERT for financial domain.
- Score: 1.8377013498056056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: BERT and IndoBERT have achieved impressive performance in several NLP tasks.
There has been several investigation on its adaption in specialized domains
especially for English language. We focus on financial domain and Indonesian
language, where we perform post-training on pre-trained IndoBERT for financial
domain using a small scale of Indonesian financial corpus. In this paper, we
construct an Indonesian self-supervised financial corpus, Indonesian financial
sentiment analysis dataset, Indonesian financial topic classification dataset,
and release a family of BERT models for financial NLP. We also evaluate the
effectiveness of domain-specific post-training on sentiment analysis and topic
classification tasks. Our findings indicate that the post-training increases
the effectiveness of a language model when it is fine-tuned to domain-specific
downstream tasks.
Related papers
- Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages [55.963648108438555]
Large language models (LLMs) show remarkable human-like capability in various domains and languages.
We introduce Cendol, a collection of Indonesian LLMs encompassing both decoder-only and encoder-decoder architectures.
We highlight Cendol's effectiveness across a diverse array of tasks, attaining 20% improvement, and demonstrate its capability to generalize.
arXiv Detail & Related papers (2024-04-09T09:04:30Z) - No Language is an Island: Unifying Chinese and English in Financial Large Language Models, Instruction Data, and Benchmarks [73.11935193630823]
ICE-PIXIU uniquely integrates a spectrum of Chinese tasks, alongside translated and original English datasets.
It provides unrestricted access to diverse model variants, a compilation of diverse cross-lingual and multi-modal instruction data, and an evaluation benchmark with expert annotations.
arXiv Detail & Related papers (2024-03-10T16:22:20Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Is ChatGPT a Financial Expert? Evaluating Language Models on Financial
Natural Language Processing [22.754757518792395]
FinLMEval is a framework for Financial Language Model Evaluation.
This study compares the performance of encoder-only language models and the decoder-only language models.
arXiv Detail & Related papers (2023-10-19T11:43:15Z) - Domain Adaptation for Arabic Machine Translation: The Case of Financial
Texts [0.7673339435080445]
We develop a parallel corpus for Arabic-English (AR- EN) translation in the financial domain.
We fine-tune several NMT and Large Language models including ChatGPT-3.5 Turbo.
The quality of ChatGPT translation was superior than other models based on automatic and human evaluations.
arXiv Detail & Related papers (2023-09-22T13:37:19Z) - Removing Non-Stationary Knowledge From Pre-Trained Language Models for
Entity-Level Sentiment Classification in Finance [0.0]
We build KorFinASC, a Korean aspect-level sentiment classification dataset for finance consisting of 12,613 human-annotated samples.
We use the term "non-stationary knowledge'' to refer to information that was previously correct but is likely to change, and present "TGT-Masking'', a novel masking pattern.
arXiv Detail & Related papers (2023-01-09T01:26:55Z) - One Country, 700+ Languages: NLP Challenges for Underrepresented
Languages and Dialects in Indonesia [60.87739250251769]
We provide an overview of the current state of NLP research for Indonesia's 700+ languages.
We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
arXiv Detail & Related papers (2022-03-24T22:07:22Z) - FinEAS: Financial Embedding Analysis of Sentiment [0.0]
We introduce a new language representation model in finance called Financial Embedding Analysis of Sentiment (FinEAS)
In this work, we propose a new model for financial sentiment analysis based on supervised fine-tuned sentence embeddings from a standard BERT model.
arXiv Detail & Related papers (2021-10-31T15:41:56Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - FinBERT: A Pretrained Language Model for Financial Communications [25.900063840368347]
There is no pretrained finance specific language models available.
We address the need by pretraining a financial domain specific BERT models, FinBERT, using a large scale of financial communication corpora.
Experiments on three financial sentiment classification tasks confirm the advantage of FinBERT over generic domain BERT model.
arXiv Detail & Related papers (2020-06-15T02:51:06Z) - DomBERT: Domain-oriented Language Model for Aspect-based Sentiment
Analysis [71.40586258509394]
We propose DomBERT, an extension of BERT to learn from both in-domain corpus and relevant domain corpora.
Experiments are conducted on an assortment of tasks in aspect-based sentiment analysis, demonstrating promising results.
arXiv Detail & Related papers (2020-04-28T21:07:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.