Leveraging Open-Source Large Language Models for Native Language Identification
- URL: http://arxiv.org/abs/2409.09659v1
- Date: Sun, 15 Sep 2024 08:14:18 GMT
- Title: Leveraging Open-Source Large Language Models for Native Language Identification
- Authors: Yee Man Ng, Ilia Markov,
- Abstract summary: Native Language Identification (NLI) has applications in forensics, marketing, and second language acquisition.
This study explores the potential of using open-source generative large language models (LLMs) for NLI.
- Score: 1.6267479602370543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Native Language Identification (NLI) - the task of identifying the native language (L1) of a person based on their writing in the second language (L2) - has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used out-of-the-box. However, when fine-tuned on labeled training data, open-source LLMs can achieve performance comparable to that of commercial LLMs.
Related papers
- Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.
Currently, instruction-tuned large language models (LLMs) excel at various English tasks.
Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z) - Open Llama2 Model for the Lithuanian Language [0.0]
We propose and describe the first open Llama2 large language models (LLMs) for the Lithuanian language.
We provide a brief review of open regional LLMs and detailed information on the proposed LLMs and their training process.
arXiv Detail & Related papers (2024-08-23T10:18:39Z) - Learning-From-Mistakes Prompting for Indigenous Language Translation [3.7790255156708397]
This paper presents techniques to improve extremely low-resourced indigenous language translations.
Our approaches are grounded in the use of a datastore consisting of a limited number of parallel translation examples.
We harness the potential of LLMs and in-context learning techniques in such a setting for using LLMs as universal translators.
arXiv Detail & Related papers (2024-07-18T09:41:20Z) - MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series [86.31735321970481]
We open-source MAP-Neo, a bilingual language model with 7B parameters trained from scratch on 4.5T high-quality tokens.
Our MAP-Neo is the first fully open-sourced bilingual LLM with comparable performance compared to existing state-of-the-art LLMs.
arXiv Detail & Related papers (2024-05-29T17:57:16Z) - MindMerger: Efficient Boosting LLM Reasoning in non-English Languages [26.334092384176518]
Reasoning capabilities are crucial for Large Language Models (LLMs)
We propose MindMerger, which merges LLMs with the external language understanding capabilities from multilingual models.
MindMerger consistently outperforms all baselines, especially in low-resource languages.
arXiv Detail & Related papers (2024-05-27T17:41:54Z) - Enhancing Multilingual Capabilities of Large Language Models through
Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora.
Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.