Automatic Correction of Syntactic Dependency Annotation Differences
- URL: http://arxiv.org/abs/2201.05891v1
- Date: Sat, 15 Jan 2022 17:17:55 GMT
- Title: Automatic Correction of Syntactic Dependency Annotation Differences
- Authors: Andrew Zupon, Andrew Carnie, Michael Hammond, Mihai Surdeanu
- Abstract summary: We propose a method for automatically detecting annotation mismatches between dependency parsing corpora.
All three methods rely on comparing an unseen example in a new corpus with similar examples in an existing corpus.
We then evaluate these conversions by retraining two dependencys -- Stanza (Qianu et al. 2020) and Parsing as Tagging (PaT) -- on the converted and unconverted data.
- Score: 17.244143187393078
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Annotation inconsistencies between data sets can cause problems for
low-resource NLP, where noisy or inconsistent data cannot be as easily replaced
compared with resource-rich languages. In this paper, we propose a method for
automatically detecting annotation mismatches between dependency parsing
corpora, as well as three related methods for automatically converting the
mismatches. All three methods rely on comparing an unseen example in a new
corpus with similar examples in an existing corpus. These three methods include
a simple lexical replacement using the most frequent tag of the example in the
existing corpus, a GloVe embedding-based replacement that considers a wider
pool of examples, and a BERT embedding-based replacement that uses
contextualized embeddings to provide examples fine-tuned to our specific data.
We then evaluate these conversions by retraining two dependency parsers --
Stanza (Qi et al. 2020) and Parsing as Tagging (PaT) (Vacareanu et al. 2020) --
on the converted and unconverted data. We find that applying our conversions
yields significantly better performance in many cases. Some differences
observed between the two parsers are observed. Stanza has a more complex
architecture with a quadratic algorithm, so it takes longer to train, but it
can generalize better with less data. The PaT parser has a simpler architecture
with a linear algorithm, speeding up training time but requiring more training
data to reach comparable or better performance.
Related papers
- SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query.
Existing methods such as similarity search and crossencoder models exhibit significant limitations.
We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z) - Few-Shot Adaptation for Parsing Contextual Utterances with LLMs [25.22099517947426]
In real-world settings, there typically exists only a limited number of contextual utterances due to annotation cost.
We examine four major paradigms for doing so in conversational semantic parsing.
Experiments with in-context learning and fine-tuning suggest that Rewrite-then-Parse is the most promising paradigm.
arXiv Detail & Related papers (2023-09-18T21:35:19Z) - Towards Unsupervised Recognition of Token-level Semantic Differences in
Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task.
We study three unsupervised approaches that rely on a masked language model.
Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z) - SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN)
Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z) - Efficient comparison of sentence embeddings [0.0]
We will discuss about various word and sentence embeddings algorithms, we will select a sentence embedding algorithm, BERT, as our algorithm of choice.
According to the results, FAISS outperforms when used in a centralized environment with only one node, especially when big datasets are included.
arXiv Detail & Related papers (2022-04-02T09:08:34Z) - FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric [48.66580267438049]
We present FastKASSIM, a metric for utterance- and document-level syntactic similarity.
It pairs and averages the most similar dependency parse trees between a pair of documents based on tree kernels.
It runs up to to 5.2 times faster than our baseline method over the documents in the r/ChangeMyView corpus.
arXiv Detail & Related papers (2022-03-15T22:33:26Z) - Comparative Study of Long Document Classification [0.0]
We revisit long document classification using standard machine learning approaches.
We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets.
arXiv Detail & Related papers (2021-11-01T04:51:51Z) - On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar.
We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods.
Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z) - Don't Parse, Insert: Multilingual Semantic Parsing with Insertion Based
Decoding [10.002379593718471]
A successful parse transforms an input utterance to an action that is easily understood by the system.
For complex parsing tasks, the state-of-the-art method is based on autoregressive sequence to sequence models to generate the parse directly.
arXiv Detail & Related papers (2020-10-08T01:18:42Z) - ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification
Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English.
We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z) - A Methodology for Creating Question Answering Corpora Using Inverse Data
Annotation [16.914116942666976]
We introduce a novel methodology to efficiently construct a corpus for question answering over structured data.
In our method, we randomly generate OTs from a context-free grammar.
We apply the method to create a new corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus.
arXiv Detail & Related papers (2020-04-16T12:50:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.