Do Moral Judgment and Reasoning Capability of LLMs Change with Language?
A Study using the Multilingual Defining Issues Test
- URL: http://arxiv.org/abs/2402.02135v1
- Date: Sat, 3 Feb 2024 12:52:36 GMT
- Title: Do Moral Judgment and Reasoning Capability of LLMs Change with Language?
A Study using the Multilingual Defining Issues Test
- Authors: Aditi Khandelwal, Utkarsh Agarwal, Kumar Tanmay, Monojit Choudhury
- Abstract summary: We extend the work of beyond English to five new languages (Chinese, Hindi, Russian, Spanish and Swahili)
Our study shows that the moral reasoning ability for all models, as indicated by the post-conventional score, is substantially inferior for Hindi and Swahili, compared to Spanish, Russian, Chinese and English.
- Score: 21.108525674360898
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper explores the moral judgment and moral reasoning abilities
exhibited by Large Language Models (LLMs) across languages through the Defining
Issues Test. It is a well known fact that moral judgment depends on the
language in which the question is asked. We extend the work of beyond English,
to 5 new languages (Chinese, Hindi, Russian, Spanish and Swahili), and probe
three LLMs -- ChatGPT, GPT-4 and Llama2Chat-70B -- that shows substantial
multilingual text processing and generation abilities. Our study shows that the
moral reasoning ability for all models, as indicated by the post-conventional
score, is substantially inferior for Hindi and Swahili, compared to Spanish,
Russian, Chinese and English, while there is no clear trend for the performance
of the latter four languages. The moral judgments too vary considerably by the
language.
Related papers
- Evaluating Large Language Models with Tests of Spanish as a Foreign Language: Pass or Fail? [2.9630910534509924]
We evaluate the performance of state-of-the-art LLMs in the recently released benchmark with similar questions to those of Spanish exams for foreign students.
Results show that LLMs perform well at understanding Spanish but are still far from achieving the level of a native speaker in terms of grammatical competence.
arXiv Detail & Related papers (2024-09-08T11:30:03Z) - Decoding Multilingual Moral Preferences: Unveiling LLM's Biases Through the Moral Machine Experiment [11.82100047858478]
This paper builds on the moral machine experiment (MME) to investigate the moral preferences of five large language models in a multilingual setting.
We generate 6500 scenarios of the MME and prompt the models in ten languages on which action to take.
Our analysis reveals that all LLMs inhibit different moral biases to some degree and that they not only differ from the human preferences but also across multiple languages within the models themselves.
arXiv Detail & Related papers (2024-07-21T14:48:13Z) - Language Model Alignment in Multilingual Trolley Problems [138.5684081822807]
Building on the Moral Machine experiment, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP.
Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions.
We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems.
arXiv Detail & Related papers (2024-07-02T14:02:53Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - Ethical Reasoning and Moral Value Alignment of LLMs Depend on the Language we Prompt them in [19.675262411557235]
This paper explores how three prominent LLMs -- GPT-4, ChatGPT, and Llama2-70B-Chat -- perform ethical reasoning in different languages.
We experiment with six languages: English, Spanish, Russian, Chinese, Hindi, and Swahili.
We find that GPT-4 is the most consistent and unbiased ethical reasoner across languages, while ChatGPT and Llama2-70B-Chat show significant moral value bias when we move to languages other than English.
arXiv Detail & Related papers (2024-04-29T06:42:27Z) - MLaKE: Multilingual Knowledge Editing Benchmark for Large Language Models [65.10456412127405]
MLaKE is a benchmark for the adaptability of knowledge editing methods across five languages.
MLaKE aggregates fact chains from Wikipedia across languages and generates questions in both free-form and multiple-choice.
We evaluate the multilingual knowledge editing generalization capabilities of existing methods on MLaKE.
arXiv Detail & Related papers (2024-04-07T15:23:28Z) - Speaking Multiple Languages Affects the Moral Bias of Language Models [70.94372902010232]
Pre-trained multilingual language models (PMLMs) are commonly used when dealing with data from multiple languages and cross-lingual transfer.
Do the models capture moral norms from English and impose them on other languages?
Our experiments demonstrate that, indeed, PMLMs encode differing moral biases, but these do not necessarily correspond to cultural differences or commonalities in human opinions.
arXiv Detail & Related papers (2022-11-14T20:08:54Z) - Do Multilingual Language Models Capture Differing Moral Norms? [71.52261949766101]
Massively multilingual sentence representations are trained on large corpora of uncurated data.
This may cause the models to grasp cultural values including moral judgments from the high-resource languages.
The lack of data in certain languages can also lead to developing random and thus potentially harmful beliefs.
arXiv Detail & Related papers (2022-03-18T12:26:37Z) - Cross-Lingual Ability of Multilingual Masked Language Models: A Study of
Language Structure [54.01613740115601]
We study three language properties: constituent order, composition and word co-occurrence.
Our main conclusion is that the contribution of constituent order and word co-occurrence is limited, while the composition is more crucial to the success of cross-linguistic transfer.
arXiv Detail & Related papers (2022-03-16T07:09:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.