Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored
Arabic LLM
- URL: http://arxiv.org/abs/2312.09366v1
- Date: Thu, 14 Dec 2023 22:04:07 GMT
- Title: Arabic Mini-ClimateGPT : A Climate Change and Sustainability Tailored
Arabic LLM
- Authors: Sahal Shaji Mullappilly, Abdelrahman Shaker, Omkar Thawakar, Hisham
Cholakkal, Rao Muhammad Anwer, Salman Khan, Fahad Shahbaz Khan
- Abstract summary: Large Language Models (LLMs) like ChatGPT and Bard have shown impressive conversational abilities and excel in a wide variety of NLP tasks.
We propose a light-weight Arabic Mini-ClimateGPT that is built on an open-source LLM and is specifically fine-tuned on a conversational-style instruction tuning Arabic dataset Clima500-Instruct.
Our model surpasses the baseline LLM in 88.3% of cases during ChatGPT-based evaluation.
- Score: 77.17254959695218
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Climate change is one of the most significant challenges we face together as
a society. Creating awareness and educating policy makers the wide-ranging
impact of climate change is an essential step towards a sustainable future.
Recently, Large Language Models (LLMs) like ChatGPT and Bard have shown
impressive conversational abilities and excel in a wide variety of NLP tasks.
While these models are close-source, recently alternative open-source LLMs such
as Stanford Alpaca and Vicuna have shown promising results. However, these
open-source models are not specifically tailored for climate related domain
specific information and also struggle to generate meaningful responses in
other languages such as, Arabic. To this end, we propose a light-weight Arabic
Mini-ClimateGPT that is built on an open-source LLM and is specifically
fine-tuned on a conversational-style instruction tuning curated Arabic dataset
Clima500-Instruct with over 500k instructions about climate change and
sustainability. Further, our model also utilizes a vector embedding based
retrieval mechanism during inference. We validate our proposed model through
quantitative and qualitative evaluations on climate-related queries. Our model
surpasses the baseline LLM in 88.3% of cases during ChatGPT-based evaluation.
Furthermore, our human expert evaluation reveals an 81.6% preference for our
model's responses over multiple popular open-source models. Our open-source
demos, code-base and models are available here
https://github.com/mbzuai-oryx/ClimateGPT.
Related papers
- GemmAr: Enhancing LLMs Through Arabic Instruction-Tuning [0.0]
We introduce InstAr-500k, a new Arabic instruction dataset created by generating and collecting content.
We assess this dataset by fine-tuning an open-source Gemma-7B model on several downstream tasks to improve its functionality.
Based on multiple evaluations, our fine-tuned model achieves excellent performance on several Arabic NLP benchmarks.
arXiv Detail & Related papers (2024-07-02T10:43:49Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - ClimateGPT: Towards AI Synthesizing Interdisciplinary Research on
Climate Change [21.827936253363603]
This paper introduces ClimateGPT, a model family of domain-specific large language models that synthesize interdisciplinary research on climate change.
We trained two 7B models from scratch on a science-oriented dataset of 300B tokens.
ClimateGPT-7B, 13B and 70B are continuously pre-trained from Llama2 on a domain-specific dataset of 4.2B tokens.
arXiv Detail & Related papers (2024-01-17T23:29:46Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning [26.151056828513962]
Climate models have been key for assessing the impact of climate change and simulating future climate scenarios.
The machine learning (ML) community has taken an increased interest in supporting climate scientists' efforts on various tasks such as climate model emulation, downscaling, and prediction tasks.
Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the Input4MIPs and CMIP6 archives.
arXiv Detail & Related papers (2023-11-07T04:55:36Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Enhancing Large Language Models with Climate Resources [5.2677629053588895]
Large language models (LLMs) have transformed the landscape of artificial intelligence by demonstrating their ability in generating human-like text.
However, they often employ imprecise language, which can be detrimental in domains where accuracy is crucial, such as climate change.
In this study, we make use of recent ideas to harness the potential of LLMs by viewing them as agents that access multiple sources.
We demonstrate the effectiveness of our method through a prototype agent that retrieves emission data from ClimateWatch.
arXiv Detail & Related papers (2023-03-31T20:24:14Z) - ClimateBert: A Pretrained Language Model for Climate-Related Text [6.9637233646722985]
Large pretrained language models (LMs) have revolutionized the field of natural language processing (NLP)
We propose ClimateBert, a transformer-based language model that is further pretrained on over 1.6 million paragraphs of climate-related texts.
We find that ClimateBertleads to a 46% improvement on a masked language model objective which, in turn, leads to lowering error rates by 3.57% to 35.71% for various climate-related downstream tasks.
arXiv Detail & Related papers (2021-10-22T18:47:34Z) - Analyzing Sustainability Reports Using Natural Language Processing [68.8204255655161]
In recent years, companies have increasingly been aiming to both mitigate their environmental impact and adapt to the changing climate context.
This is reported via increasingly exhaustive reports, which cover many types of climate risks and exposures under the umbrella of Environmental, Social, and Governance (ESG)
We present this tool and the methodology that we used to develop it in the present article.
arXiv Detail & Related papers (2020-11-03T21:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.