A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias
- URL: http://arxiv.org/abs/2404.00929v2
- Date: Thu, 6 Jun 2024 16:04:15 GMT
- Title: A Survey on Multilingual Large Language Models: Corpora, Alignment, and Bias
- Authors: Yuemei Xu, Ling Hu, Jiayi Zhao, Zihan Qiu, Yuqi Ye, Hanwen Gu,
- Abstract summary: We present an overview of MLLMs, covering their evolution, key techniques, and multilingual capacities.
We explore widely utilized multilingual corpora for MLLMs' training and multilingual datasets oriented for downstream tasks.
We discuss bias on MLLMs including its category and evaluation metrics, and summarize the existing debiasing techniques.
- Score: 5.104497013562654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Based on the foundation of Large Language Models (LLMs), Multilingual Large Language Models (MLLMs) have been developed to address the challenges of multilingual natural language processing tasks, hoping to achieve knowledge transfer from high-resource to low-resource languages. However, significant limitations and challenges still exist, such as language imbalance, multilingual alignment, and inherent bias. In this paper, we aim to provide a comprehensive analysis of MLLMs, delving deeply into discussions surrounding these critical issues. First of all, we start by presenting an overview of MLLMs, covering their evolution, key techniques, and multilingual capacities. Secondly, we explore widely utilized multilingual corpora for MLLMs' training and multilingual datasets oriented for downstream tasks that are crucial for enhancing the cross-lingual capability of MLLMs. Thirdly, we survey the existing studies on multilingual representations and investigate whether the current MLLMs can learn a universal language representation. Fourthly, we discuss bias on MLLMs including its category and evaluation metrics, and summarize the existing debiasing techniques. Finally, we discuss existing challenges and point out promising research directions. By demonstrating these aspects, this paper aims to facilitate a deeper understanding of MLLMs and their potentiality in various domains.
Related papers
- Multilingual Large Language Models: A Systematic Survey [38.972546467173565]
This paper provides a comprehensive survey of the latest research on multilingual large language models (MLLMs)
We first discuss the architecture and pre-training objectives of MLLMs, highlighting the key components and methodologies that contribute to their multilingual capabilities.
We present a detailed taxonomy and roadmap covering the assessment of MLLMs' cross-lingual knowledge, reasoning, alignment with human values, safety, interpretability and specialized applications.
arXiv Detail & Related papers (2024-11-17T13:21:26Z) - LLM for Everyone: Representing the Underrepresented in Large Language Models [21.07409393578553]
This thesis aims to bridge the gap in NLP research and development by focusing on underrepresented languages.
A comprehensive evaluation of large language models (LLMs) is conducted to assess their capabilities in these languages.
The proposed solutions cover cross-lingual continual instruction tuning, retrieval-based cross-lingual in-context learning, and in-context query alignment.
arXiv Detail & Related papers (2024-09-20T20:53:22Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.
But can these models relate corresponding concepts across languages, effectively being crosslingual?
This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z) - Towards Truthful Multilingual Large Language Models: Benchmarking and Alignment Strategies [38.3269908062146]
We construct a benchmark for truthfulness evaluation in multilingual scenarios.
We propose Fact-aware Multilingual Selective Synergy (FaMSS) to optimize the data allocation across a large number of languages.
arXiv Detail & Related papers (2024-06-20T15:59:07Z) - A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [48.314619377988436]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing.
Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient.
This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z) - Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers [81.47046536073682]
We present a review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual large language models (MLLMs) literature.
We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.
arXiv Detail & Related papers (2024-04-07T11:52:44Z) - Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models [117.20416338476856]
Large language models (LLMs) demonstrate remarkable multilingual capabilities without being pre-trained on specially curated multilingual parallel corpora.
We propose a novel detection method, language activation probability entropy (LAPE), to identify language-specific neurons within LLMs.
Our findings indicate that LLMs' proficiency in processing a particular language is predominantly due to a small subset of neurons.
arXiv Detail & Related papers (2024-02-26T09:36:05Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.
This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z) - A Primer on Pretrained Multilingual Language Models [18.943173499882885]
Multilingual Language Models (MLLMs) have emerged as a viable option for bringing the power of pretraining to a large number of languages.
We review the existing literature covering the above broad areas of research pertaining to MLLMs.
arXiv Detail & Related papers (2021-07-01T18:01:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.