Related papers: Domain-Specific Language Model Post-Training for Indonesian Financial NLP

Domain-Specific Language Model Post-Training for Indonesian Financial NLP

URL: http://arxiv.org/abs/2310.09736v1
Date: Sun, 15 Oct 2023 05:07:08 GMT
Title: Domain-Specific Language Model Post-Training for Indonesian Financial NLP
Authors: Ni Putu Intan Maharani, Yoga Yustiawan, Fauzy Caesar Rochim, Ayu Purwarianti
Abstract summary: BERT and IndoBERT have achieved impressive performance in several NLP tasks. We focus on financial domain and Indonesian language, where we perform post-training on pre-trained IndoBERT for financial domain.
Score: 1.8377013498056056
License: http://creativecommons.org/licenses/by/4.0/
Abstract: BERT and IndoBERT have achieved impressive performance in several NLP tasks. There has been several investigation on its adaption in specialized domains especially for English language. We focus on financial domain and Indonesian language, where we perform post-training on pre-trained IndoBERT for financial domain using a small scale of Indonesian financial corpus. In this paper, we construct an Indonesian self-supervised financial corpus, Indonesian financial sentiment analysis dataset, Indonesian financial topic classification dataset, and release a family of BERT models for financial NLP. We also evaluate the effectiveness of domain-specific post-training on sentiment analysis and topic classification tasks. Our findings indicate that the post-training increases the effectiveness of a language model when it is fine-tuned to domain-specific downstream tasks.

Related papers

Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities. We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets. We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z)
Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts. Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models. The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z)
No Language is an Island: Unifying Chinese and English in Financial Large Language Models, Instruction Data, and Benchmarks [75.29561463156635]
ICE-PIXIU uniquely integrates a spectrum of Chinese tasks, alongside translated and original English datasets. It provides unrestricted access to diverse model variants, a compilation of diverse cross-lingual and multi-modal instruction data, and an evaluation benchmark with expert annotations.
arXiv Detail & Related papers (2024-03-10T16:22:20Z)
Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing [22.754757518792395]
FinLMEval is a framework for Financial Language Model Evaluation. This study compares the performance of encoder-only language models and the decoder-only language models.
arXiv Detail & Related papers (2023-10-19T11:43:15Z)
Domain Adaptation for Arabic Machine Translation: The Case of Financial Texts [0.7673339435080445]
We develop a parallel corpus for Arabic-English (AR- EN) translation in the financial domain. We fine-tune several NMT and Large Language models including ChatGPT-3.5 Turbo. The quality of ChatGPT translation was superior than other models based on automatic and human evaluations.
arXiv Detail & Related papers (2023-09-22T13:37:19Z)
Removing Non-Stationary Knowledge From Pre-Trained Language Models for Entity-Level Sentiment Classification in Finance [0.0]
We build KorFinASC, a Korean aspect-level sentiment classification dataset for finance consisting of 12,613 human-annotated samples. We use the term "non-stationary knowledge'' to refer to information that was previously correct but is likely to change, and present "TGT-Masking'', a novel masking pattern.
arXiv Detail & Related papers (2023-01-09T01:26:55Z)
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia [60.87739250251769]
We provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems.
arXiv Detail & Related papers (2022-03-24T22:07:22Z)
FinEAS: Financial Embedding Analysis of Sentiment [0.0]
We introduce a new language representation model in finance called Financial Embedding Analysis of Sentiment (FinEAS) In this work, we propose a new model for financial sentiment analysis based on supervised fine-tuned sentence embeddings from a standard BERT model.
arXiv Detail & Related papers (2021-10-31T15:41:56Z)
Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
FinBERT: A Pretrained Language Model for Financial Communications [25.900063840368347]
There is no pretrained finance specific language models available. We address the need by pretraining a financial domain specific BERT models, FinBERT, using a large scale of financial communication corpora. Experiments on three financial sentiment classification tasks confirm the advantage of FinBERT over generic domain BERT model.
arXiv Detail & Related papers (2020-06-15T02:51:06Z)
DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis [71.40586258509394]
We propose DomBERT, an extension of BERT to learn from both in-domain corpus and relevant domain corpora. Experiments are conducted on an assortment of tasks in aspect-based sentiment analysis, demonstrating promising results.
arXiv Detail & Related papers (2020-04-28T21:07:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.