Parsing Early Modern English for Linguistic Search
- URL: http://arxiv.org/abs/2002.10546v1
- Date: Mon, 24 Feb 2020 21:04:51 GMT
- Title: Parsing Early Modern English for Linguistic Search
- Authors: Seth Kulick and Neville Ryant
- Abstract summary: We investigate whether advances in NLP make it possible to vastly increase the size of data usable for research in historical syntax.
This brings together many of the usual tools in NLP - word embeddings, tagging, and parsing - in the service of linguistic queries over automatically annotated corpora.
We train a part-of-speech (POS) tagger and on a corpus of historical English, using ELMo embeddings trained over a billion words of similar text.
- Score: 3.927039542429003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the question of whether advances in NLP over the last few
years make it possible to vastly increase the size of data usable for research
in historical syntax. This brings together many of the usual tools in NLP -
word embeddings, tagging, and parsing - in the service of linguistic queries
over automatically annotated corpora. We train a part-of-speech (POS) tagger
and parser on a corpus of historical English, using ELMo embeddings trained
over a billion words of similar text. The evaluation is based on the standard
metrics, as well as on the accuracy of the query searches using the parsed
data.
Related papers
- Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - Syntactic Language Change in English and German: Metrics, Parsers, and Convergences [56.47832275431858]
The current paper looks at diachronic trends in syntactic language change in both English and German, using corpora of parliamentary debates from the last c. 160 years.
We base our observations on five dependencys, including the widely used Stanford Core as well as 4 newer alternatives.
We show that changes in syntactic measures seem to be more frequent at the tails of sentence length distributions.
arXiv Detail & Related papers (2024-02-18T11:46:16Z) - Dense X Retrieval: What Retrieval Granularity Should We Use? [56.90827473115201]
Often-overlooked design choice is the retrieval unit in which the corpus is indexed, e.g. document, passage, or sentence.
We introduce a novel retrieval unit, proposition, for dense retrieval.
Experiments reveal that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
arXiv Detail & Related papers (2023-12-11T18:57:35Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - Semantic Parsing for Conversational Question Answering over Knowledge
Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Query-Based Keyphrase Extraction from Long Documents [4.823229052465654]
This paper overcomes issue for keyphrase extraction by chunking the long documents.
System employs a pre-trained BERT model and adapts it to estimate the probability that a given text span forms a keyphrase.
arXiv Detail & Related papers (2022-05-11T10:29:30Z) - Automatically Ranked Russian Paraphrase Corpus for Text Generation [0.0]
The article is focused on automatic development and ranking of a large corpus for Russian paraphrase generation.
Existing manually annotated paraphrase datasets for Russian are limited to small-sized ParaPhraser corpus and ParaPlag.
arXiv Detail & Related papers (2020-06-17T08:40:52Z) - A Methodology for Creating Question Answering Corpora Using Inverse Data
Annotation [16.914116942666976]
We introduce a novel methodology to efficiently construct a corpus for question answering over structured data.
In our method, we randomly generate OTs from a context-free grammar.
We apply the method to create a new corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus.
arXiv Detail & Related papers (2020-04-16T12:50:01Z) - Is POS Tagging Necessary or Even Helpful for Neural Dependency Parsing? [22.93722845643562]
We show that POS tagging can still significantly improve parsing performance when using the Stack joint framework.
Considering that it is much cheaper to annotate POS tags than parse trees, we also investigate the utilization of large-scale heterogeneous POS tag data.
arXiv Detail & Related papers (2020-03-06T13:47:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.