Author Unknown: Evaluating Performance of Author Extraction Libraries on Global Online News Articles
- URL: http://arxiv.org/abs/2410.19771v1
- Date: Sun, 13 Oct 2024 20:19:15 GMT
- Title: Author Unknown: Evaluating Performance of Author Extraction Libraries on Global Online News Articles
- Authors: Sriharsha Hatwar, Virginia Partridge, Rahul Bhargava, Fernando Bermejo,
- Abstract summary: We present a manually coded cross-lingual dataset of authors of online news articles.
We use it to evaluate the performance of five existing software packages and one customized model.
Go-readability and Trafilatura are the most consistent solutions for author extraction, but we find all packages produce highly variable results across languages.
- Score: 41.97931444618385
- License:
- Abstract: Analysis of large corpora of online news content requires robust validation of underlying metadata extraction methodologies. Identifying the author of a given web-based news article is one example that enables various types of research questions. While numerous solutions for off-the-shelf author extraction exist, there is little work comparing performance (especially in multilingual settings). In this paper we present a manually coded cross-lingual dataset of authors of online news articles and use it to evaluate the performance of five existing software packages and one customized model. Our evaluation shows evidence for Go-readability and Trafilatura as the most consistent solutions for author extraction, but we find all packages produce highly variable results across languages. These findings are relevant for researchers wishing to utilize author data in their analysis pipelines, primarily indicating that further validation for specific languages and geographies is required to rely on results.
Related papers
- Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores [12.86467344792873]
The proposed method uses soft labels derived from expert-augmented scores to fine-tune embedding models.
The paper evaluates the method using a Q&A dataset from an online shopping website and eight expert models.
arXiv Detail & Related papers (2024-08-19T01:59:25Z) - Leveraging Large Language Models for Mobile App Review Feature Extraction [4.879919005707447]
This study explores the hypothesis that encoder-only large language models can enhance feature extraction from mobile app reviews.
By leveraging crowdsourced annotations from an industrial context, we redefine feature extraction as a supervised token classification task.
Empirical evaluations demonstrate that this method improves the precision and recall of extracted features and enhances performance efficiency.
arXiv Detail & Related papers (2024-08-02T07:31:57Z) - POLygraph: Polish Fake News Dataset [0.37698262166557467]
This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish.
The dataset is composed of two parts: the "fake-or-not" dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the "fake-they-say" dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them.
The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity.
arXiv Detail & Related papers (2024-07-01T15:45:21Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Leveraging Contextual Information for Effective Entity Salience Detection [21.30389576465761]
We show that fine-tuning medium-sized language models with a cross-encoder style architecture yields substantial performance gains over feature engineering approaches.
We also show that zero-shot prompting of instruction-tuned language models yields inferior results, indicating the task's uniqueness and complexity.
arXiv Detail & Related papers (2023-09-14T19:04:40Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.