Know thy corpus! Robust methods for digital curation of Web corpora
- URL: http://arxiv.org/abs/2003.06389v1
- Date: Fri, 13 Mar 2020 17:21:57 GMT
- Title: Know thy corpus! Robust methods for digital curation of Web corpora
- Authors: Serge Sharoff
- Abstract summary: This paper proposes a novel framework for digital curation of Web corpora.
It provides robust estimation of their parameters, such as their composition and the lexicon.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a novel framework for digital curation of Web corpora in
order to provide robust estimation of their parameters, such as their
composition and the lexicon. In recent years language models pre-trained on
large corpora emerged as clear winners in numerous NLP tasks, but no proper
analysis of the corpora which led to their success has been conducted. The
paper presents a procedure for robust frequency estimation, which helps in
establishing the core lexicon for a given corpus, as well as a procedure for
estimating the corpus composition via unsupervised topic models and via
supervised genre classification of Web pages. The results of the digital
curation study applied to several Web-derived corpora demonstrate their
considerable differences. First, this concerns different frequency bursts which
impact the core lexicon obtained from each corpus. Second, this concerns the
kinds of texts they contain. For example, OpenWebText contains considerably
more topical news and political argumentation in comparison to ukWac or
Wikipedia. The tools and the results of analysis have been released.
Related papers
- New Textual Corpora for Serbian Language Modeling [0.0]
The uniqueness of both old and new corpora will be accessed via frequency-based stylometric methods.
The paper will introduce three new corpora: a new umbrella web corpus of Serbo-Croatian, a new high-quality corpus based on the doctoral dissertations stored within National Repository of Doctoral dissertations from all Universities in Serbia, and a parallel corpus of abstract translation from the same source.
arXiv Detail & Related papers (2024-05-15T11:05:16Z) - What's In My Big Data? [67.04525616289949]
We propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora.
WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node.
Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content.
arXiv Detail & Related papers (2023-10-31T17:59:38Z) - MIND - Mainstream and Independent News Documents Corpus [0.7347989843033033]
This paper characterizes MIND, a new Portuguese corpus comprised of different types of articles collected from online mainstream and alternative media sources.
The articles in the corpus are organized into five collections: facts, opinions, entertainment, satires, and conspiracy theories.
arXiv Detail & Related papers (2021-08-13T14:00:12Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - What's in the Box? An Analysis of Undesirable Content in the Common
Crawl Corpus [77.34726150561087]
We analyze the Common Crawl, a colossal web corpus extensively used for training language models.
We find that it contains a significant amount of undesirable content, including hate speech and sexually explicit content, even after filtering procedures.
arXiv Detail & Related papers (2021-05-06T14:49:43Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z) - Hierarchical Bi-Directional Self-Attention Networks for Paper Review
Rating Recommendation [81.55533657694016]
We propose a Hierarchical bi-directional self-attention Network framework (HabNet) for paper review rating prediction and recommendation.
Specifically, we leverage the hierarchical structure of the paper reviews with three levels of encoders: sentence encoder (level one), intra-review encoder (level two) and inter-review encoder (level three)
We are able to identify useful predictors to make the final acceptance decision, as well as to help discover the inconsistency between numerical review ratings and text sentiment conveyed by reviewers.
arXiv Detail & Related papers (2020-11-02T08:07:50Z) - Graph-based Topic Extraction from Vector Embeddings of Text Documents:
Application to a Corpus of News Articles [0.0]
We present an unsupervised framework that brings together powerful vector embeddings from natural language processing with tools from multiscale graph partitioning.
We show the advantages of graph-based clustering through end-to-end comparisons with other popular clustering and topic modelling methods.
This work is showcased through an analysis of a corpus of US news coverage during the presidential election year of 2016.
arXiv Detail & Related papers (2020-10-28T16:20:05Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - The Discussion Tracker Corpus of Collaborative Argumentation [2.800857580710507]
The Discussion Tracker corpus was collected in American high school English classes.
The corpus consists of 29 multi-party discussions of English literature transcribed from 985 minutes of audio.
arXiv Detail & Related papers (2020-05-22T18:27:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.