CREER: A Large-Scale Corpus for Relation Extraction and Entity
Recognition
- URL: http://arxiv.org/abs/2204.12710v1
- Date: Wed, 27 Apr 2022 05:43:21 GMT
- Title: CREER: A Large-Scale Corpus for Relation Extraction and Entity
Recognition
- Authors: Yu-Siou Tang and Chung-Hsien Wu
- Abstract summary: The CREER dataset uses the Stanford CoreNLP Annotator to capture rich language structures from Wikipedia plain text.
This dataset follows widely used linguistic and semantic annotations so that it can be used for not only most natural language processing tasks but also scaling the dataset.
- Score: 9.54366784050374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe the design and use of the CREER dataset, a large corpus annotated
with rich English grammar and semantic attributes. The CREER dataset uses the
Stanford CoreNLP Annotator to capture rich language structures from Wikipedia
plain text. This dataset follows widely used linguistic and semantic
annotations so that it can be used for not only most natural language
processing tasks but also scaling the dataset. This large supervised dataset
can serve as the basis for improving the performance of NLP tasks in the
future.
Related papers
- Unlocking Korean Verbs: A User-Friendly Exploration into the Verb Lexicon [5.358486800301437]
Sejong dictionary dataset offers extensive coverage of morphology, syntax, and semantic representation.
The labeled linguistic structures within this dataset form the basis for uncovering relationships between words and phrases.
This paper introduces a user-friendly web interface designed for the collection and consolidation of verb-related information.
arXiv Detail & Related papers (2024-10-01T22:03:34Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - Improving Domain-Specific Retrieval by NLI Fine-Tuning [64.79760042717822]
This article investigates the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.
arXiv Detail & Related papers (2023-08-06T12:40:58Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - What's in a Name? Evaluating Assembly-Part Semantic Knowledge in
Language Models through User-Provided Names in CAD Files [4.387757291346397]
We propose that the natural language names designers use in Computer Aided Design (CAD) software are a valuable source of such knowledge.
In particular we extract and clean a large corpus of natural language part, feature and document names.
We show that fine-tuning on the text data corpus further boosts the performance on all tasks, thus demonstrating the value of the text data.
arXiv Detail & Related papers (2023-04-25T12:30:01Z) - Entity Aware Syntax Tree Based Data Augmentation for Natural Language
Understanding [5.02493891738617]
We propose a novel NLP data augmentation technique, which applies a tree structure, Entity Aware Syntax Tree (EAST) to represent sentences combined with attention on the entity.
Our EADA technique automatically constructs an EAST from a small amount of annotated data, and then generates a large number of training instances for intent detection and slot filling.
Experimental results on four datasets showed that the proposed technique significantly outperforms the existing data augmentation methods in terms of both accuracy and generalization ability.
arXiv Detail & Related papers (2022-09-06T07:34:10Z) - Annotated Dataset Creation through General Purpose Language Models for
non-English Medical NLP [0.5482532589225552]
In our work we suggest to leverage pretrained language models for training data acquisition.
We create a custom dataset which we use to train a medical NER model for German texts, GPTNERMED.
arXiv Detail & Related papers (2022-08-30T18:42:55Z) - SCROLLS: Standardized CompaRison Over Long Language Sequences [62.574959194373264]
We introduce SCROLLS, a suite of tasks that require reasoning over long texts.
SCROLLS contains summarization, question answering, and natural language inference tasks.
We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
arXiv Detail & Related papers (2022-01-10T18:47:15Z) - An Exploratory Study on Utilising the Web of Linked Data for Product
Data Mining [3.7376948366228175]
This work focuses on the e-commerce domain to explore methods of utilising structured data to create language resources that may be used for product classification and linking.
We process billions of structured data points in the form of RDF n-quads, to create multi-million words of product-related corpora that are later used in three different ways for creating of language resources.
Our evaluation on an extensive set of benchmarks shows word embeddings to be the most reliable and consistent method to improve the accuracy on both tasks.
arXiv Detail & Related papers (2021-09-03T09:58:36Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.