Automatic Compilation of Resources for Academic Writing and Evaluating
with Informal Word Identification and Paraphrasing System
- URL: http://arxiv.org/abs/2003.02955v1
- Date: Thu, 5 Mar 2020 22:55:45 GMT
- Title: Automatic Compilation of Resources for Academic Writing and Evaluating
with Informal Word Identification and Paraphrasing System
- Authors: Seid Muhie Yimam and Gopalakrishnan Venkatesh and John Sie Yuen Lee
and Chris Biemann
- Abstract summary: We present the first approach to automatically building resources for academic writing.
The aim is to build a writing aid system that automatically edits a text so that it better adheres to the academic style of writing.
- Score: 24.42822218256954
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the first approach to automatically building resources for
academic writing. The aim is to build a writing aid system that automatically
edits a text so that it better adheres to the academic style of writing. On top
of existing academic resources, such as the Corpus of Contemporary American
English (COCA) academic Word List, the New Academic Word List, and the Academic
Collocation List, we also explore how to dynamically build such resources that
would be used to automatically identify informal or non-academic words or
phrases. The resources are compiled using different generic approaches that can
be extended for different domains and languages. We describe the evaluation of
resources with a system implementation. The system consists of an informal word
identification (IWI), academic candidate paraphrase generation, and paraphrase
ranking components. To generate candidates and rank them in context, we have
used the PPDB and WordNet paraphrase resources. We use the Concepts in Context
(CoInCO) "All-Words" lexical substitution dataset both for the informal word
identification and paraphrase generation experiments. Our informal word
identification component achieves an F-1 score of 82%, significantly
outperforming a stratified classifier baseline. The main contribution of this
work is a domain-independent methodology to build targeted resources for
writing aids.
Related papers
- Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - Monolingual alignment of word senses and definitions in lexicographical
resources [0.0]
The focus of this thesis is broadly on the alignment of lexicographical data, particularly dictionaries.
The first task aims to find an optimal alignment given the sense definitions of a headword in two different monolingual dictionaries.
This benchmark can be used for evaluation purposes of word-sense alignment systems.
arXiv Detail & Related papers (2022-09-06T13:09:52Z) - The Fellowship of the Authors: Disambiguating Names from Social Network
Context [2.3605348648054454]
Authority lists with extensive textual descriptions for each entity are lacking and ambiguous named entities.
We combine BERT-based mention representations with a variety of graph induction strategies and experiment with supervised and unsupervised cluster inference methods.
We find that in-domain language model pretraining can significantly improve mention representations, especially for larger corpora.
arXiv Detail & Related papers (2022-08-31T21:51:55Z) - Taxonomy Enrichment with Text and Graph Vector Representations [61.814256012166794]
We address the problem of taxonomy enrichment which aims at adding new words to the existing taxonomy.
We present a new method that allows achieving high results on this task with little effort.
We achieve state-of-the-art results across different datasets and provide an in-depth error analysis of mistakes.
arXiv Detail & Related papers (2022-01-21T09:01:12Z) - Towards Document-Level Paraphrase Generation with Sentence Rewriting and
Reordering [88.08581016329398]
We propose CoRPG (Coherence Relationship guided Paraphrase Generation) for document-level paraphrase generation.
We use graph GRU to encode the coherence relationship graph and get the coherence-aware representation for each sentence.
Our model can generate document paraphrase with more diversity and semantic preservation.
arXiv Detail & Related papers (2021-09-15T05:53:40Z) - LexSubCon: Integrating Knowledge from Lexical Resources into Contextual
Embeddings for Lexical Substitution [76.615287796753]
We introduce LexSubCon, an end-to-end lexical substitution framework based on contextual embedding models.
This is achieved by combining contextual information with knowledge from structured lexical resources.
Our experiments show that LexSubCon outperforms previous state-of-the-art methods on LS07 and CoInCo benchmark datasets.
arXiv Detail & Related papers (2021-07-11T21:25:56Z) - Dual Attention Model for Citation Recommendation [7.244791479777266]
We propose a novel embedding-based neural network called "dual attention model for citation recommendation"
A neural network is designed to maximize the similarity between the embedding of the three input (local context words, section and structural contexts) and the target citation appearing in the context.
arXiv Detail & Related papers (2020-10-01T02:41:47Z) - Techniques for Vocabulary Expansion in Hybrid Speech Recognition Systems [54.49880724137688]
The problem of out of vocabulary words (OOV) is typical for any speech recognition system.
One of the popular approach to cover OOVs is to use subword units rather then words.
In this paper we explore different existing methods of this solution on both graph construction and search method levels.
arXiv Detail & Related papers (2020-03-19T21:24:45Z) - Word Sense Disambiguation for 158 Languages using Word Embeddings Only [80.79437083582643]
Disambiguation of word senses in context is easy for humans, but a major challenge for automatic approaches.
We present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory.
We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings.
arXiv Detail & Related papers (2020-03-14T14:50:04Z) - Detecting New Word Meanings: A Comparison of Word Embedding Models in
Spanish [1.5356167668895644]
Semantic neologisms (SN) are words that acquire a new word meaning while maintaining their form.
To detect SN in a semi-automatic way, we developed a system that implements a combination of the following strategies.
We examine the following word embedding models: Word2Vec, Sense2Vec, and FastText.
arXiv Detail & Related papers (2020-01-12T21:54:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.