AMALGUM -- A Free, Balanced, Multilayer English Web Corpus
- URL: http://arxiv.org/abs/2006.10677v1
- Date: Thu, 18 Jun 2020 17:05:45 GMT
- Title: AMALGUM -- A Free, Balanced, Multilayer English Web Corpus
- Authors: Luke Gessler, Siyao Peng, Yang Liu, Yilun Zhu, Shabnam Behzad, Amir
Zeldes
- Abstract summary: We present a genre-balanced English web corpus totaling 4M tokens.
By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets.
- Score: 14.073494095236027
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present a freely available, genre-balanced English web corpus totaling 4M
tokens and featuring a large number of high-quality automatic annotation
layers, including dependency trees, non-named entity annotations, coreference
resolution, and discourse trees in Rhetorical Structure Theory. By tapping open
online data sources the corpus is meant to offer a more sizable alternative to
smaller manually created annotated data sets, while avoiding pitfalls such as
imbalanced or unknown composition, licensing problems, and low-quality natural
language processing. We harness knowledge from multiple annotation layers in
order to achieve a "better than NLP" benchmark and evaluate the accuracy of the
resulting resource.
Related papers
- FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis [0.0]
The Algerian dialect (AD) faces challenges due to the absence of annotated corpora.
This study outlines the development process of a specialized corpus for Fake News (FN) detection and sentiment analysis (SA) in AD called FASSILA.
arXiv Detail & Related papers (2024-11-07T10:39:10Z) - DecorateLM: Data Engineering through Corpus Rating, Tagging, and Editing with Language Models [78.51470038301436]
We introduce DecorateLM, a data engineering method designed to refine the pretraining corpus through data rating, tagging and editing.
We then apply DecorateLM to enhance 100 billion tokens of the training corpus, selecting 45 billion tokens that exemplify high quality and diversity for the further training of another 1.2 billion parameter LLM.
Our results demonstrate that employing such high-quality data can significantly boost model performance, showcasing a powerful approach to enhance the quality of the pretraining corpus.
arXiv Detail & Related papers (2024-10-08T02:42:56Z) - UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs [19.097842830790405]
Existing benchmarks for summarization quality evaluation often lack diverse input scenarios and focus on narrowly defined dimensions.
We create UniSumEval benchmark, which extends the range of input context and provides fine-grained, multi-dimensional annotations.
arXiv Detail & Related papers (2024-09-30T02:56:35Z) - Separating the Wheat from the Chaff with BREAD: An open-source benchmark
and metrics to detect redundancy in text [9.484323358958706]
We create and release BREAD, a human-labeled benchmark on repetitive boilerplate vs. plausible linguistic content.
We release several baseline CRED (Character REDundancy) scores along with it, and evaluate their effectiveness on BREAD.
arXiv Detail & Related papers (2023-11-11T00:11:50Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Multilingual Word Sense Disambiguation with Unified Sense Representation [55.3061179361177]
We propose building knowledge and supervised-based Multilingual Word Sense Disambiguation (MWSD) systems.
We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich-sourced languages to poorer ones.
Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.
arXiv Detail & Related papers (2022-10-14T01:24:03Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - Named Entity Recognition and Linking Augmented with Large-Scale
Structured Data [3.211619859724085]
We describe our submissions to the 2nd and 3rd SlavNER Shared Tasks held at BSNLP 2019 and BSNLP 2021.
The tasks focused on the analysis of Named Entities in multilingual Web documents in Slavic languages with rich inflection.
Our solution takes advantage of large collections of both unstructured and structured documents.
arXiv Detail & Related papers (2021-04-27T20:10:18Z) - Exhaustive Entity Recognition for Coptic: Challenges and Solutions [8.980876474818153]
In this paper we present entity recognition for Coptic, the language of Hellenistic era Egypt.
Weevaluate NLP approaches to the task and lay out difficulties in applying them to a low-resource,morphologically complex language.
We present solutions for named and non-named nested en-tity recognition and semi-automatic entity linking to Wikipedia, relying on robust dependencyparsing, feature-based CRF models, and hand-crafted knowledge base resources.
arXiv Detail & Related papers (2020-11-03T23:49:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.