Analyzing Scientific Publications using Domain-Specific Word Embedding
and Topic Modelling
- URL: http://arxiv.org/abs/2112.12940v1
- Date: Fri, 24 Dec 2021 04:25:34 GMT
- Title: Analyzing Scientific Publications using Domain-Specific Word Embedding
and Topic Modelling
- Authors: Trisha Singhal, Junhua Liu, Lucienne T.M. Blessing, Kwan Hui Lim
- Abstract summary: This paper presents a framework for conducting scientific analyses of academic publications.
It combines various techniques of Natural Language Processing, such as word embedding and topic modelling.
We propose two novel scientific publication embedding, i.e., PUB-G and PUB-W, which are capable of learning semantic meanings of general as well as domain-specific words.
- Score: 0.6308539010172307
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The scientific world is changing at a rapid pace, with new technology being
developed and new trends being set at an increasing frequency. This paper
presents a framework for conducting scientific analyses of academic
publications, which is crucial to monitor research trends and identify
potential innovations. This framework adopts and combines various techniques of
Natural Language Processing, such as word embedding and topic modelling. Word
embedding is used to capture semantic meanings of domain-specific words. We
propose two novel scientific publication embedding, i.e., PUB-G and PUB-W,
which are capable of learning semantic meanings of general as well as
domain-specific words in various research fields. Thereafter, topic modelling
is used to identify clusters of research topics within these larger research
fields. We curated a publication dataset consisting of two conferences and two
journals from 1995 to 2020 from two research domains. Experimental results show
that our PUB-G and PUB-W embeddings are superior in comparison to other
baseline embeddings by a margin of ~0.18-1.03 based on topic coherence.
Related papers
- A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference [65.37685198688538]
This paper presents MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains.
We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs)
We show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset.
arXiv Detail & Related papers (2024-04-11T18:12:12Z) - AHAM: Adapt, Help, Ask, Model -- Harvesting LLMs for literature mining [3.8384235322772864]
We present the AHAM' methodology and a metric that guides the domain-specific textbfadaptation of the BERTopic topic modeling framework.
By utilizing the LLaMa2 generative language model, we generate topic definitions via one-shot learning.
For inter-topic similarity evaluation, we leverage metrics from language generation and translation processes.
arXiv Detail & Related papers (2023-12-25T18:23:03Z) - An Inclusive Notion of Text [69.36678873492373]
We argue that clarity on the notion of text is crucial for reproducible and generalizable NLP.
We introduce a two-tier taxonomy of linguistic and non-linguistic elements that are available in textual sources and can be used in NLP modeling.
arXiv Detail & Related papers (2022-11-10T14:26:43Z) - Revise and Resubmit: An Intertextual Model of Text-based Collaboration
in Peer Review [52.359007622096684]
Peer review is a key component of the publishing process in most fields of science.
Existing NLP studies focus on the analysis of individual texts.
editorial assistance often requires modeling interactions between pairs of texts.
arXiv Detail & Related papers (2022-04-22T16:39:38Z) - SciNoBo : A Hierarchical Multi-Label Classifier of Scientific
Publications [0.7305019142196583]
Classifying scientific publications according to Field-of-Science (FoS) is of crucial importance.
We present SciNoBo, a novel classification system of publications to predefined FoS.
In contrast to other works, our system supports assignments of publications to multiple fields by considering their multi-arity potential.
arXiv Detail & Related papers (2022-04-02T15:09:33Z) - Change Summarization of Diachronic Scholarly Paper Collections by
Semantic Evolution Analysis [10.554831859741851]
We demonstrate a novel approach to analyze the collections of research papers published over longer time periods.
Our approach is based on comparing word semantic representations over time and aims to support users in a better understanding of large domain-focused archives of scholarly publications.
arXiv Detail & Related papers (2021-12-07T11:15:19Z) - Domain-adaptation of spherical embeddings [0.0]
We develop methods to counter the global rotation of the embedding space and propose strategies to update words and documents during domain specific training.
We show that our strategies are able to reduce the performance cost of domain adaptation to a level similar to Word2Vec.
arXiv Detail & Related papers (2021-11-01T03:29:36Z) - Domain Generalization: A Survey [146.68420112164577]
Domain generalization (DG) aims to achieve OOD generalization by only using source domain data for model learning.
For the first time, a comprehensive literature review is provided to summarize the ten-year development in DG.
arXiv Detail & Related papers (2021-03-03T16:12:22Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z) - Semantic and Relational Spaces in Science of Science: Deep Learning
Models for Article Vectorisation [4.178929174617172]
We focus on document-level embeddings based on the semantic and relational aspects of articles, using Natural Language Processing (NLP) and Graph Neural Networks (GNNs)
Our results show that using NLP we can encode a semantic space of articles, while with GNN we are able to build a relational space where the social practices of a research community are also encoded.
arXiv Detail & Related papers (2020-11-05T14:57:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.