MIND - Mainstream and Independent News Documents Corpus
- URL: http://arxiv.org/abs/2108.06249v1
- Date: Fri, 13 Aug 2021 14:00:12 GMT
- Title: MIND - Mainstream and Independent News Documents Corpus
- Authors: Danielle Caled, Paula Carvalho, M\'ario J. Silva
- Abstract summary: This paper characterizes MIND, a new Portuguese corpus comprised of different types of articles collected from online mainstream and alternative media sources.
The articles in the corpus are organized into five collections: facts, opinions, entertainment, satires, and conspiracy theories.
- Score: 0.7347989843033033
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents and characterizes MIND, a new Portuguese corpus comprised
of different types of articles collected from online mainstream and alternative
media sources, over a 10-month period. The articles in the corpus are organized
into five collections: facts, opinions, entertainment, satires, and conspiracy
theories. Throughout this paper, we explain how the data collection process was
conducted, and present a set of linguistic metrics that allow us to perform a
preliminary characterization of the texts included in the corpus. Also, we
deliver an analysis of the most frequent topics in the corpus, and discuss the
main differences and similarities among the collections considered. Finally, we
enumerate some tasks and applications that could benefit from this corpus, in
particular the ones (in)directly related to misinformation detection. Overall,
our contribution of a corpus and initial analysis are designed to support
future exploratory news studies, and provide a better insight into
misinformation.
Related papers
- Interactive Topic Models with Optimal Transport [75.26555710661908]
We present EdTM, as an approach for label name supervised topic modeling.
EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities.
arXiv Detail & Related papers (2024-06-28T13:57:27Z) - Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction [61.998789448260005]
We propose to identify the typical structure of document within a collection.
We abstract over arbitrary header paraphrases, and ground each topic to respective document locations.
We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
arXiv Detail & Related papers (2024-02-21T16:22:21Z) - Code Book for the Annotation of Diverse Cross-Document Coreference of
Entities in News Articles [0.0]
It includes a precise description of how to set up Inception, a respective annotation tool, how to annotate entities in news articles, connect them with diverse coreferential relations, and link them across documents to Wikidata's global knowledge graph.
Our main contribution lies in providing a methodology for creating a diverse cross-document coreference corpus which can be applied to the analysis of media bias by word-choice and labelling.
arXiv Detail & Related papers (2023-10-18T15:53:45Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - Towards Corpus-Scale Discovery of Selection Biases in News Coverage:
Comparing What Sources Say About Entities as a Start [65.28355014154549]
This paper investigates the challenges of building scalable NLP systems for discovering patterns of media selection biases directly from news content in massive-scale news corpora.
We show the capabilities of the framework through a case study on NELA-2020, a corpus of 1.8M news articles in English from 519 news sources worldwide.
arXiv Detail & Related papers (2023-04-06T23:36:45Z) - Revise and Resubmit: An Intertextual Model of Text-based Collaboration
in Peer Review [52.359007622096684]
Peer review is a key component of the publishing process in most fields of science.
Existing NLP studies focus on the analysis of individual texts.
editorial assistance often requires modeling interactions between pairs of texts.
arXiv Detail & Related papers (2022-04-22T16:39:38Z) - Models and Datasets for Cross-Lingual Summarisation [78.56238251185214]
We present a cross-lingual summarisation corpus with long documents in a source language associated with multi-sentence summaries in a target language.
The corpus covers twelve language pairs and directions for four European languages, namely Czech, English, French and German.
We derive cross-lingual document-summary instances from Wikipedia by combining lead paragraphs and articles' bodies from language aligned Wikipedia titles.
arXiv Detail & Related papers (2022-02-19T11:55:40Z) - Cross-context News Corpus for Protest Events related Knowledge Base
Construction [0.15393457051344295]
We describe a gold standard corpus of protest events that comprise of various local and international sources in English.
This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information.
arXiv Detail & Related papers (2020-08-01T22:20:48Z) - Quantum Criticism: A Tagged News Corpus Analysed for Sentiment and Named
Entities [18.458831729497224]
We continuously collect data from the RSS feeds of traditional news sources.
We perform sentiment analysis of each news article at the document, paragraph and sentence level.
We show how the data in this corpus could be used to identify bias in news reporting.
arXiv Detail & Related papers (2020-06-05T17:59:12Z) - The Discussion Tracker Corpus of Collaborative Argumentation [2.800857580710507]
The Discussion Tracker corpus was collected in American high school English classes.
The corpus consists of 29 multi-party discussions of English literature transcribed from 985 minutes of audio.
arXiv Detail & Related papers (2020-05-22T18:27:28Z) - Know thy corpus! Robust methods for digital curation of Web corpora [0.0]
This paper proposes a novel framework for digital curation of Web corpora.
It provides robust estimation of their parameters, such as their composition and the lexicon.
arXiv Detail & Related papers (2020-03-13T17:21:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.