ClimateBert: A Pretrained Language Model for Climate-Related Text
- URL: http://arxiv.org/abs/2110.12010v1
- Date: Fri, 22 Oct 2021 18:47:34 GMT
- Title: ClimateBert: A Pretrained Language Model for Climate-Related Text
- Authors: Nicolas Webersinke, Mathias Kraus, Julia Anna Bingler, Markus Leippold
- Abstract summary: Large pretrained language models (LMs) have revolutionized the field of natural language processing (NLP)
We propose ClimateBert, a transformer-based language model that is further pretrained on over 1.6 million paragraphs of climate-related texts.
We find that ClimateBertleads to a 46% improvement on a masked language model objective which, in turn, leads to lowering error rates by 3.57% to 35.71% for various climate-related downstream tasks.
- Score: 6.9637233646722985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the recent years, large pretrained language models (LM) have
revolutionized the field of natural language processing (NLP). However, while
pretraining on general language has been shown to work very well for common
language, it has been observed that niche language poses problems. In
particular, climate-related texts include specific language that common LMs can
not represent accurately. We argue that this shortcoming of today's LMs limits
the applicability of modern NLP to the broad field of text processing of
climate-related texts. As a remedy, we propose ClimateBert, a transformer-based
language model that is further pretrained on over 1.6 million paragraphs of
climate-related texts, crawled from various sources such as common news,
research articles, and climate reporting of companies. We find that
ClimateBertleads to a 46% improvement on a masked language model objective
which, in turn, leads to lowering error rates by 3.57% to 35.71% for various
climate-related downstream tasks like text classification, sentiment analysis,
and fact-checking.
Related papers
- Since the Scientific Literature Is Multilingual, Our Models Should Be Too [8.039428445336364]
We show that the literature is largely multilingual and argue that current models and benchmarks should reflect this linguistic diversity.
We provide evidence that text-based models fail to create meaningful representations for non-English papers and highlight the negative user-facing impacts of using English-only models non-discriminately across a multilingual domain.
arXiv Detail & Related papers (2024-03-27T04:47:10Z) - ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on
Climate Change [21.827936253363603]
This paper introduces ClimateGPT, a model family of domain-specific large language models that synthesize interdisciplinary research on climate change.
We trained two 7B models from scratch on a science-oriented dataset of 300B tokens.
ClimateGPT-7B, 13B and 70B are continuously pre-trained from Llama2 on a domain-specific dataset of 4.2B tokens.
arXiv Detail & Related papers (2024-01-17T23:29:46Z) - Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored
Arabic LLM [77.17254959695218]
Large Language Models (LLMs) like ChatGPT and Bard have shown impressive conversational abilities and excel in a wide variety of NLP tasks.
We propose a light-weight Arabic Mini-ClimateGPT that is built on an open-source LLM and is specifically fine-tuned on a conversational-style instruction tuning Arabic dataset Clima500-Instruct.
Our model surpasses the baseline LLM in 88.3% of cases during ChatGPT-based evaluation.
arXiv Detail & Related papers (2023-12-14T22:04:07Z) - Cross-Lingual Knowledge Editing in Large Language Models [73.12622532088564]
Knowledge editing has been shown to adapt large language models to new knowledge without retraining from scratch.
It is still unknown the effect of source language editing on a different target language.
We first collect a large-scale cross-lingual synthetic dataset by translating ZsRE from English to Chinese.
arXiv Detail & Related papers (2023-09-16T11:07:52Z) - Enhancing Large Language Models with Climate Resources [5.2677629053588895]
Large language models (LLMs) have transformed the landscape of artificial intelligence by demonstrating their ability in generating human-like text.
However, they often employ imprecise language, which can be detrimental in domains where accuracy is crucial, such as climate change.
In this study, we make use of recent ideas to harness the potential of LLMs by viewing them as agents that access multiple sources.
We demonstrate the effectiveness of our method through a prototype agent that retrieves emission data from ClimateWatch.
arXiv Detail & Related papers (2023-03-31T20:24:14Z) - Language Model Behavior: A Comprehensive Survey [5.663056267168211]
We discuss over 250 recent studies of English language model behavior before task-specific fine-tuning.
Despite dramatic increases in generated text quality as models scale to hundreds of billions of parameters, the models are still prone to unfactual responses, commonsense errors, memorized text, and social biases.
arXiv Detail & Related papers (2023-03-20T23:54:26Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - AmericasNLI: Evaluating Zero-shot Natural Language Understanding of
Pretrained Multilingual Models in Truly Low-resource Languages [75.08199398141744]
We present AmericasNLI, an extension of XNLI (Conneau et al.), to 10 indigenous languages of the Americas.
We conduct experiments with XLM-R, testing multiple zero-shot and translation-based approaches.
We find that XLM-R's zero-shot performance is poor for all 10 languages, with an average performance of 38.62%.
arXiv Detail & Related papers (2021-04-18T05:32:28Z) - Analyzing Sustainability Reports Using Natural Language Processing [68.8204255655161]
In recent years, companies have increasingly been aiming to both mitigate their environmental impact and adapt to the changing climate context.
This is reported via increasingly exhaustive reports, which cover many types of climate risks and exposures under the umbrella of Environmental, Social, and Governance (ESG)
We present this tool and the methodology that we used to develop it in the present article.
arXiv Detail & Related papers (2020-11-03T21:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.