Related papers: Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF

Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF

URL: http://arxiv.org/abs/2107.05151v1
Date: Sun, 11 Jul 2021 23:58:39 GMT
Title: Document Embedding for Scientific Articles: Efficacy of Word Embeddings vs TFIDF
Authors: H.J. Meijer, J. Truong, R. Karimi
Abstract summary: This research focuses on the performance of word embeddings applied to a large scale academic corpus. We compare quality and efficiency of trained word embeddings to TFIDF representations in modeling content of scientific articles. Our results show that content models based on word embeddings are better for titles (short text) while TFIDF works better for abstracts (longer text)
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Over the last few years, neural network derived word embeddings became popular in the natural language processing literature. Studies conducted have mostly focused on the quality and application of word embeddings trained on public available corpuses such as Wikipedia or other news and social media sources. However, these studies are limited to generic text and thus lack technical and scientific nuances such as domain specific vocabulary, abbreviations, or scientific formulas which are commonly used in academic context. This research focuses on the performance of word embeddings applied to a large scale academic corpus. More specifically, we compare quality and efficiency of trained word embeddings to TFIDF representations in modeling content of scientific articles. We use a word2vec skip-gram model trained on titles and abstracts of about 70 million scientific articles. Furthermore, we have developed a benchmark to evaluate content models in a scientific context. The benchmark is based on a categorization task that matches articles to journals for about 1.3 million articles published in 2017. Our results show that content models based on word embeddings are better for titles (short text) while TFIDF works better for abstracts (longer text). However, the slight improvement of TFIDF for larger text comes at the expense of 3.7 times more memory requirement as well as up to 184 times higher computation times which may make it inefficient for online applications. In addition, we have created a 2-dimensional visualization of the journals modeled via embeddings to qualitatively inspect embedding model. This graph shows useful insights and can be used to find competitive journals or gaps to propose new journals.

Related papers

Comprehensive Manuscript Assessment with Text Summarization Using 69707 articles [10.943765373420135]
We harness Scopus to curate a significantly comprehensive and large-scale dataset of information from 69707 scientific articles. We propose a deep learning methodology for the impact-based classification tasks, which leverages semantic features extracted from the manuscripts and paper metadata.
arXiv Detail & Related papers (2025-03-26T07:56:15Z)
The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z)
CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation. We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z)
MIST: a Large-Scale Annotated Resource and Neural Models for Functions of Modal Verbs in English Scientific Text [1.8502316793903635]
We introduce the MIST dataset, which contains 3737 modal instances in five scientific domains annotated for their semantic, pragmatic, or rhetorical function. We systematically evaluate a set of competitive neural architectures on MIST. Our corpus analysis provides evidence that scientific communities differ in their usage of modal verbs.
arXiv Detail & Related papers (2022-12-14T11:10:03Z)
TERMinator: A system for scientific texts processing [0.0]
This paper is devoted to the extraction of entities and semantic relations between them from scientific texts. We present a dataset that includes annotations for two tasks and develop a system called TERMinator for the study of the influence of language models on term recognition.
arXiv Detail & Related papers (2022-09-29T15:14:42Z)
Automatic Analysis of Linguistic Features in Journal Articles of Different Academic Impacts with Feature Engineering Techniques [0.975434908987426]
This study attempts to extract micro-level linguistic features in high- and moderate-impact journal RAs, using feature engineering methods. We extracted 25 highly relevant features from the Corpus of English Journal Articles through feature selection methods. Results showed that 24 linguistic features such as the overlapping of content words between adjacent sentences, the use of third-person pronouns, auxiliary verbs, tense, emotional words provide consistent and accurate predictions for journal articles with different academic impacts.
arXiv Detail & Related papers (2021-11-15T03:56:50Z)
CitationIE: Leveraging the Citation Graph for Scientific Information Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers. We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z)
CiteWorth: Cite-Worthiness Detection for Improved Scientific Document Understanding [23.930041685595775]
We present an in-depth study of cite-worthiness detection in English, where a sentence is labelled for whether or not it cites an external source. CiteWorth is high-quality, challenging, and suitable for studying problems such as domain adaptation.
arXiv Detail & Related papers (2021-05-23T11:08:45Z)
Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph. We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains. Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z)
What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work. We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels. We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z)
Be More with Less: Hypergraph Attention Networks for Inductive Text Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task. Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words. We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z)
Large Scale Subject Category Classification of Scholarly Papers with Deep Attentive Neural Networks [15.241086410108512]
We propose a deep attentive neural network (DANN) that classifies scholarly papers using only their abstracts. The proposed network consists of two bi-directional recurrent neural networks followed by an attention layer. Our best model achieves micro-F1 measure of 0.76 with F1 of individual subject categories ranging from 0.50-0.95.
arXiv Detail & Related papers (2020-07-27T19:42:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.