Detecting ESG topics using domain-specific language models and data
augmentation approaches
- URL: http://arxiv.org/abs/2010.08319v1
- Date: Fri, 16 Oct 2020 11:20:07 GMT
- Title: Detecting ESG topics using domain-specific language models and data
augmentation approaches
- Authors: Tim Nugent, Nicole Stelea and Jochen L. Leidner
- Abstract summary: Natural language processing tasks in the financial domain remain challenging due to paucity of appropriately labelled data.
Here, we investigate two approaches that may help to mitigate these issues.
Firstly, we experiment with further language model pre-training using large amounts of in-domain data from business and financial news.
We then apply augmentation approaches to increase the size of our dataset for model fine-tuning.
- Score: 3.3332986505989446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite recent advances in deep learning-based language modelling, many
natural language processing (NLP) tasks in the financial domain remain
challenging due to the paucity of appropriately labelled data. Other issues
that can limit task performance are differences in word distribution between
the general corpora - typically used to pre-train language models - and
financial corpora, which often exhibit specialized language and symbology.
Here, we investigate two approaches that may help to mitigate these issues.
Firstly, we experiment with further language model pre-training using large
amounts of in-domain data from business and financial news. We then apply
augmentation approaches to increase the size of our dataset for model
fine-tuning. We report our findings on an Environmental, Social and Governance
(ESG) controversies dataset and demonstrate that both approaches are beneficial
to accuracy in classification tasks.
Related papers
- Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution [7.681258910515419]
Tabular data presents unique challenges due to its heterogeneous nature and complex structural relationships.
High predictive performance and robustness in tabular data analysis holds significant promise for numerous applications.
The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning.
arXiv Detail & Related papers (2024-08-20T04:59:19Z) - We're Calling an Intervention: Exploring the Fundamental Hurdles in Adapting Language Models to Nonstandard Text [8.956635443376527]
We present a suite of experiments that allow us to understand the underlying challenges of language model adaptation to nonstandard text.
We do so by designing interventions that approximate several types of linguistic variation and their interactions with existing biases of language models.
Applying our interventions during language model adaptation with varying size and nature of training data, we gain important insights into when knowledge transfer can be successful.
arXiv Detail & Related papers (2024-04-10T18:56:53Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Model-Agnostic Meta-Learning for Natural Language Understanding Tasks in
Finance [1.863067234952186]
We investigate model-agnostic meta-learning algorithm(MAML) in low-resource financial NLU tasks.
Our models achieve the state-of-the-art performance according to the experimental results.
arXiv Detail & Related papers (2023-03-06T02:24:48Z) - Effective Transfer Learning for Low-Resource Natural Language
Understanding [15.752309656576129]
We focus on developing cross-lingual and cross-domain methods to tackle the low-resource issues.
First, we propose to improve the model's cross-lingual ability by focusing on the task-related keywords.
Second, we present Order-Reduced Modeling methods for the cross-lingual adaptation.
Third, we propose to leverage different levels of domain-related corpora and additional masking of data in the pre-training for the cross-domain adaptation.
arXiv Detail & Related papers (2022-08-19T06:59:00Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.