Related papers: Language-Agnostic Modeling of Source Reliability on Wikipedia

Language-Agnostic Modeling of Source Reliability on Wikipedia

URL: http://arxiv.org/abs/2410.18803v1
Date: Thu, 24 Oct 2024 14:52:21 GMT
Title: Language-Agnostic Modeling of Source Reliability on Wikipedia
Authors: Jacopo D'Ignazi, Andreas Kaltenbrunner, Yelena Mejova, Michele Tizzani, Kyriaki Kalimeri, Mariano Beiró, Pablo Aragón,
Abstract summary: We present a language-agnostic model designed to assess the reliability of sources across multiple language editions of Wikipedia. The model effectively predicts source reliability, achieving an F1 Macro score of approximately 0.80 for English. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels.
Score: 2.6474867060112346
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Over the last few years, content verification through reliable sources has become a fundamental need to combat disinformation. Here, we present a language-agnostic model designed to assess the reliability of sources across multiple language editions of Wikipedia. Utilizing editorial activity data, the model evaluates source reliability within different articles of varying controversiality such as Climate Change, COVID-19, History, Media, and Biology topics. Crafting features that express domain usage across articles, the model effectively predicts source reliability, achieving an F1 Macro score of approximately 0.80 for English and other high-resource languages. For mid-resource languages, we achieve 0.65 while the performance of low-resource languages varies; in all cases, the time the domain remains present in the articles (which we dub as permanence) is one of the most predictive features. We highlight the challenge of maintaining consistent model performance across languages of varying resource levels and demonstrate that adapting models from higher-resource languages can improve performance. This work contributes not only to Wikipedia's efforts in ensuring content verifiability but in ensuring reliability across diverse user-generated content in various language communities.

Related papers

Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages [0.43498389175652036]
This study integrates traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. We demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. While the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters.
arXiv Detail & Related papers (2025-03-30T18:03:52Z)
Language-Agnostic Modeling of Wikipedia Articles for Content Quality Assessment across Languages [0.19698344608599344]
We propose a novel computational framework for modeling the quality of Wikipedia articles. Our framework is based on language-agnostic structural features extracted from the articles. We have built datasets with the feature values and quality scores of all revisions of all articles in the existing language versions of Wikipedia.
arXiv Detail & Related papers (2024-04-15T13:07:31Z)
Cross-lingual Transfer Learning for Javanese Dependency Parsing [0.20537467311538835]
This study focuses on assessing the efficacy of transfer learning in enhancing dependency parsing for Javanese. We utilize the Universal Dependencies dataset consisting of dependency treebanks from more than 100 languages, including Javanese.
arXiv Detail & Related papers (2024-01-22T16:13:45Z)
GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z)
A Comparative Study of Reference Reliability in Multiple Language Editions of Wikipedia [12.919146538916353]
This study examines over 5 million Wikipedia articles to assess the reliability of references in multiple language editions. Some sources deemed untrustworthy in one language (i.e., English) continue to appear in articles in other languages. Non-authoritative sources found in the English version of a page tend to persist in other language versions of that page.
arXiv Detail & Related papers (2023-09-01T01:19:59Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
An Open Dataset and Model for Language Identification [84.15194457400253]
We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033 across 201 languages. We make both the model and the dataset available to the research community.
arXiv Detail & Related papers (2023-05-23T08:43:42Z)
Considerations for Multilingual Wikipedia Research [1.5736899098702972]
Non-English language editions of Wikipedia have led to the inclusion of many more language editions in datasets and models. This paper seeks to provide some background to help researchers think about what differences might arise between different language editions of Wikipedia.
arXiv Detail & Related papers (2022-04-05T20:34:15Z)
Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements. We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)
XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
Learning to Learn Morphological Inflection for Resource-Poor Languages [105.11499402984482]
We propose to cast the task of morphological inflection - mapping a lemma to an indicated inflected form - for resource-poor languages as a meta-learning problem. Treating each language as a separate task, we use data from high-resource source languages to learn a set of model parameters. Experiments with two model architectures on 29 target languages from 3 families show that our suggested approach outperforms all baselines.
arXiv Detail & Related papers (2020-04-28T05:13:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.