Provenance for Linguistic Corpora Through Nanopublications
- URL: http://arxiv.org/abs/2006.06341v2
- Date: Mon, 2 Nov 2020 07:29:09 GMT
- Title: Provenance for Linguistic Corpora Through Nanopublications
- Authors: Timo Lek, Anna de Groot, Tobias Kuhn, Roser Morante
- Abstract summary: Research in Computational Linguistics is dependent on text corpora for training and testing new tools and methodologies.
While there exists a plethora of annotated linguistic information, these corpora are often not interoperable without significant manual work.
This paper addresses this issue with a case study on event annotated corpora and by creating a new, more interoperable representation of this data in the form of nanopublications.
- Score: 0.22940141855172028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Research in Computational Linguistics is dependent on text corpora for
training and testing new tools and methodologies. While there exists a plethora
of annotated linguistic information, these corpora are often not interoperable
without significant manual work. Moreover, these annotations might have evolved
into different versions, making it challenging for researchers to know the
data's provenance. This paper addresses this issue with a case study on event
annotated corpora and by creating a new, more interoperable representation of
this data in the form of nanopublications. We demonstrate how linguistic
annotations from separate corpora can be reliably linked from the start, and
thereby be accessed and queried as if they were a single dataset. We describe
how such nanopublications can be created and demonstrate how SPARQL queries can
be performed to extract interesting content from the new representations. The
queries show that information of multiple corpora can be retrieved more easily
and effectively because the information of different corpora is represented in
a uniform data format.
Related papers
- Synthetic continued pretraining [29.6872772403251]
We propose synthetic continued pretraining on a small corpus of domain-specific documents.
We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm.
We show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
arXiv Detail & Related papers (2024-09-11T17:21:59Z) - Seed-Guided Fine-Grained Entity Typing in Science and Engineering
Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains.
We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus.
It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z) - GPT Struct Me: Probing GPT Models on Narrative Entity Extraction [2.049592435988883]
We evaluate the capabilities of two state-of-the-art language models -- GPT-3 and GPT-3.5 -- in the extraction of narrative entities.
This study is conducted on the Text2Story Lusa dataset, a collection of 119 Portuguese news articles.
arXiv Detail & Related papers (2023-11-24T16:19:04Z) - Syntax and Semantics Meet in the "Middle": Probing the Syntax-Semantics
Interface of LMs Through Agentivity [68.8204255655161]
We present the semantic notion of agentivity as a case study for probing such interactions.
This suggests LMs may potentially serve as more useful tools for linguistic annotation, theory testing, and discovery.
arXiv Detail & Related papers (2023-05-29T16:24:01Z) - Prompting Language Models for Linguistic Structure [73.11488464916668]
We present a structured prompting approach for linguistic structured prediction tasks.
We evaluate this approach on part-of-speech tagging, named entity recognition, and sentence chunking.
We find that while PLMs contain significant prior knowledge of task labels due to task leakage into the pretraining corpus, structured prompting can also retrieve linguistic structure with arbitrary labels.
arXiv Detail & Related papers (2022-11-15T01:13:39Z) - Explaining Patterns in Data with Language Models via Interpretable
Autoprompting [143.4162028260874]
We introduce interpretable autoprompting (iPrompt), an algorithm that generates a natural-language string explaining the data.
iPrompt can yield meaningful insights by accurately finding groundtruth dataset descriptions.
Experiments with an fMRI dataset show the potential for iPrompt to aid in scientific discovery.
arXiv Detail & Related papers (2022-10-04T18:32:14Z) - Combining pre-trained language models and structured knowledge [9.521634184008574]
transformer-based language models have achieved state of the art performance in various NLP benchmarks.
It has proven challenging to integrate structured information, such as knowledge graphs into these models.
We examine a variety of approaches to integrate structured knowledge into current language models and determine challenges, and possible opportunities to leverage both structured and unstructured information sources.
arXiv Detail & Related papers (2021-01-28T21:54:03Z) - Intrinsic Probing through Dimension Selection [69.52439198455438]
Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks.
Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it.
In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted.
arXiv Detail & Related papers (2020-10-06T15:21:08Z) - Observations on Annotations [0.5175994976508882]
It approaches the topic from several angles including Hypertext, Computational Linguistics and Language Technology, Artificial Intelligence and Open Science.
In terms of complexity, they can range from trivial to highly sophisticated, in terms of maturity from experimental to standardised.
Primary research data such as, e.g., text documents can be annotated on different layers concurrently, which are independent but can be exploited using multi-layer querying.
arXiv Detail & Related papers (2020-04-21T20:29:50Z) - LowResourceEval-2019: a shared task on morphological analysis for
low-resource languages [0.30998852056211795]
The paper describes the results of the first shared task on morphological analysis for the languages of Russia, namely, Evenki, Karelian, Selkup, and Veps.
The tasks include morphological analysis, word form generation and morpheme segmentation.
The article describes the datasets prepared for the shared tasks and contains analysis of the participants' solutions.
arXiv Detail & Related papers (2020-01-30T12:47:50Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.