Related papers: Automatic Correction of Syntactic Dependency Annotation Differences

Automatic Correction of Syntactic Dependency Annotation Differences

URL: http://arxiv.org/abs/2201.05891v1
Date: Sat, 15 Jan 2022 17:17:55 GMT
Title: Automatic Correction of Syntactic Dependency Annotation Differences
Authors: Andrew Zupon, Andrew Carnie, Michael Hammond, Mihai Surdeanu
Abstract summary: We propose a method for automatically detecting annotation mismatches between dependency parsing corpora. All three methods rely on comparing an unseen example in a new corpus with similar examples in an existing corpus. We then evaluate these conversions by retraining two dependencys -- Stanza (Qianu et al. 2020) and Parsing as Tagging (PaT) -- on the converted and unconverted data.
Score: 17.244143187393078
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Annotation inconsistencies between data sets can cause problems for low-resource NLP, where noisy or inconsistent data cannot be as easily replaced compared with resource-rich languages. In this paper, we propose a method for automatically detecting annotation mismatches between dependency parsing corpora, as well as three related methods for automatically converting the mismatches. All three methods rely on comparing an unseen example in a new corpus with similar examples in an existing corpus. These three methods include a simple lexical replacement using the most frequent tag of the example in the existing corpus, a GloVe embedding-based replacement that considers a wider pool of examples, and a BERT embedding-based replacement that uses contextualized embeddings to provide examples fine-tuned to our specific data. We then evaluate these conversions by retraining two dependency parsers -- Stanza (Qi et al. 2020) and Parsing as Tagging (PaT) (Vacareanu et al. 2020) -- on the converted and unconverted data. We find that applying our conversions yields significantly better performance in many cases. Some differences observed between the two parsers are observed. Stanza has a more complex architecture with a quadratic algorithm, so it takes longer to train, but it can generalize better with less data. The PaT parser has a simpler architecture with a linear algorithm, speeding up training time but requiring more training data to reach comparable or better performance.

Related papers

SparseCL: Sparse Contrastive Learning for Contradiction Retrieval [87.02936971689817]
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query. Existing methods such as similarity search and crossencoder models exhibit significant limitations. We introduce SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences.
arXiv Detail & Related papers (2024-06-15T21:57:03Z)
Few-Shot Adaptation for Parsing Contextual Utterances with LLMs [25.22099517947426]
In real-world settings, there typically exists only a limited number of contextual utterances due to annotation cost. We examine four major paradigms for doing so in conversational semantic parsing. Experiments with in-context learning and fine-tuning suggest that Rewrite-then-Parse is the most promising paradigm.
arXiv Detail & Related papers (2023-09-18T21:35:19Z)
Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents [61.63208012250885]
We formulate recognizing semantic differences as a token-level regression task. We study three unsupervised approaches that rely on a masked language model. Our results show that an approach based on word alignment and sentence-level contrastive learning has a robust correlation to gold labels.
arXiv Detail & Related papers (2023-05-22T17:58:04Z)
SUN: Exploring Intrinsic Uncertainties in Text-to-SQL Parsers [61.48159785138462]
This paper aims to improve the performance of text-to-dependence by exploring the intrinsic uncertainties in the neural network based approaches (called SUN) Extensive experiments on five benchmark datasets demonstrate that our method significantly outperforms competitors and achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-09-14T06:27:51Z)
Efficient comparison of sentence embeddings [0.0]
We will discuss about various word and sentence embeddings algorithms, we will select a sentence embedding algorithm, BERT, as our algorithm of choice. According to the results, FAISS outperforms when used in a centralized environment with only one node, especially when big datasets are included.
arXiv Detail & Related papers (2022-04-02T09:08:34Z)
FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric [48.66580267438049]
We present FastKASSIM, a metric for utterance- and document-level syntactic similarity. It pairs and averages the most similar dependency parse trees between a pair of documents based on tree kernels. It runs up to to 5.2 times faster than our baseline method over the documents in the r/ChangeMyView corpus.
arXiv Detail & Related papers (2022-03-15T22:33:26Z)
Comparative Study of Long Document Classification [0.0]
We revisit long document classification using standard machine learning approaches. We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets.
arXiv Detail & Related papers (2021-11-01T04:51:51Z)
On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar. We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods. Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z)
Don't Parse, Insert: Multilingual Semantic Parsing with Insertion Based Decoding [10.002379593718471]
A successful parse transforms an input utterance to an action that is easily understood by the system. For complex parsing tasks, the state-of-the-art method is based on autoregressive sequence to sequence models to generate the parse directly.
arXiv Detail & Related papers (2020-10-08T01:18:42Z)
ASSET: A Dataset for Tuning and Evaluation of Sentence Simplification Models with Multiple Rewriting Transformations [97.27005783856285]
This paper introduces ASSET, a new dataset for assessing sentence simplification in English. We show that simplifications in ASSET are better at capturing characteristics of simplicity when compared to other standard evaluation datasets for the task.
arXiv Detail & Related papers (2020-05-01T16:44:54Z)
A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation [16.914116942666976]
We introduce a novel methodology to efficiently construct a corpus for question answering over structured data. In our method, we randomly generate OTs from a context-free grammar. We apply the method to create a new corpus OTTA (Operation Trees and Token Assignment), a large semantic parsing corpus.
arXiv Detail & Related papers (2020-04-16T12:50:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.