Related papers: Cross-Lingual Learning vs. Low-Resource Fine-Tuning: A Case Study with Fact-Checking in Turkish

Cross-Lingual Learning vs. Low-Resource Fine-Tuning: A Case Study with Fact-Checking in Turkish

URL: http://arxiv.org/abs/2403.00411v2
Date: Fri, 22 Mar 2024 15:54:03 GMT
Title: Cross-Lingual Learning vs. Low-Resource Fine-Tuning: A Case Study with Fact-Checking in Turkish
Authors: Recep Firat Cekinel, Pinar Karagoz, Cagri Coltekin,
Abstract summary: We introduce the FCTR dataset, consisting of 3238 real-world claims. This dataset spans multiple domains and incorporates evidence collected from three Turkish fact-checking organizations.
Score: 0.9217021281095907
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The rapid spread of misinformation through social media platforms has raised concerns regarding its impact on public opinion. While misinformation is prevalent in other languages, the majority of research in this field has concentrated on the English language. Hence, there is a scarcity of datasets for other languages, including Turkish. To address this concern, we have introduced the FCTR dataset, consisting of 3238 real-world claims. This dataset spans multiple domains and incorporates evidence collected from three Turkish fact-checking organizations. Additionally, we aim to assess the effectiveness of cross-lingual transfer learning for low-resource languages, with a particular focus on Turkish. We demonstrate in-context learning (zero-shot and few-shot) performance of large language models in this context. The experimental results indicate that the dataset has the potential to advance research in the Turkish language.

Related papers

Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish [9.111556632499472]
Cetvel is a benchmark designed to evaluate large language models (LLMs) in Turkish.<n>It combines a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language.
arXiv Detail & Related papers (2025-08-22T14:42:50Z)
SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods [1.2091341579150698]
We release datasets of sentences containing polysemous words across ten low-resource languages.<n>To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method.<n>Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation.
arXiv Detail & Related papers (2025-05-29T17:48:08Z)
Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation [38.81102126876936]
This paper introduces a novel retrieval-based method that enhances translation quality for low-resource languages by focusing on key terms. To evaluate the effectiveness of this method, we conducted experiments translating from English into three low-resource languages: Cherokee, a critically endangered indigenous language of North America; Tibetan, a historically and culturally significant language in Asia; and Manchu, a language with few remaining speakers. Our comparison with the zero-shot performance of GPT-4o and LLaMA 3.1 405B, highlights the significant challenges these models face when translating into low-resource languages.
arXiv Detail & Related papers (2024-11-18T05:41:27Z)
A Comparative Study of Translation Bias and Accuracy in Multilingual Large Language Models for Cross-Language Claim Verification [1.566834021297545]
This study systematically evaluates translation bias and the effectiveness of Large Language Models for cross-lingual claim verification. We investigate two distinct translation methods: pre-translation and self-translation. Our findings reveal that low-resource languages exhibit significantly lower accuracy in direct inference due to underrepresentation.
arXiv Detail & Related papers (2024-10-14T09:02:42Z)
Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z)
LLMs Are Few-Shot In-Context Low-Resource Language Learners [59.74451570590808]
In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages. We extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages. Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs.
arXiv Detail & Related papers (2024-03-25T07:55:29Z)
Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Developing and Evaluating Tiny to Medium-Sized Turkish BERT Models [0.0]
This study introduces and evaluates tiny, mini, small, and medium-sized uncased Turkish BERT models. We trained these models on a diverse dataset encompassing over 75GB of text from multiple sources and tested them on several tasks, including mask prediction, sentiment analysis, news classification, and, zero-shot classification.
arXiv Detail & Related papers (2023-07-26T12:02:30Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
CONCRETE: Improving Cross-lingual Fact-checking with Cross-lingual Retrieval [73.48591773882052]
Most fact-checking approaches focus on English only due to the data scarcity issue in other languages. We present the first fact-checking framework augmented with crosslingual retrieval. We train the retriever with our proposed Crosslingual Inverse Cloze Task (XICT)
arXiv Detail & Related papers (2022-09-05T17:36:14Z)
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z)
Data and Representation for Turkish Natural Language Inference [6.135815931215188]
We offer a positive response for natural language inference (NLI) in Turkish. We translate two large English NLI datasets into Turkish and had a team of experts validate their translation quality and fidelity to the original labels. We find that in-language embeddings are essential and that morphological parsing can be avoided where the training set is large.
arXiv Detail & Related papers (2020-04-30T17:12:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.