The Effect of Metadata on Scientific Literature Tagging: A Cross-Field
Cross-Model Study
- URL: http://arxiv.org/abs/2302.03341v1
- Date: Tue, 7 Feb 2023 09:34:41 GMT
- Title: The Effect of Metadata on Scientific Literature Tagging: A Cross-Field
Cross-Model Study
- Authors: Yu Zhang, Bowen Jin, Qi Zhu, Yu Meng, Jiawei Han
- Abstract summary: We systematically study the effect of metadata on scientific literature tagging across 19 fields.
We observe some ubiquitous patterns of metadata's effects across all fields.
- Score: 29.965010251365946
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the exponential growth of scientific publications on the Web, there is
a pressing need to tag each paper with fine-grained topics so that researchers
can track their interested fields of study rather than drowning in the whole
literature. Scientific literature tagging is beyond a pure multi-label text
classification task because papers on the Web are prevalently accompanied by
metadata information such as venues, authors, and references, which may serve
as additional signals to infer relevant tags. Although there have been studies
making use of metadata in academic paper classification, their focus is often
restricted to one or two scientific fields (e.g., computer science and
biomedicine) and to one specific model. In this work, we systematically study
the effect of metadata on scientific literature tagging across 19 fields. We
select three representative multi-label classifiers (i.e., a bag-of-words
model, a sequence-based model, and a pre-trained language model) and explore
their performance change in scientific literature tagging when metadata are fed
to the classifiers as additional features. We observe some ubiquitous patterns
of metadata's effects across all fields (e.g., venues are consistently
beneficial to paper tagging in almost all cases), as well as some unique
patterns in fields other than computer science and biomedicine, which are not
explored in previous studies.
Related papers
- An Instance-based Plus Ensemble Learning Method for Classification of Scientific Papers [2.0794749869068005]
This paper introduces a novel approach that combines instance-based learning and ensemble learning techniques for classifying scientific papers.
Experiments show that the proposed classification method is effective and efficient in categorizing papers into various research areas.
arXiv Detail & Related papers (2024-09-21T19:42:15Z) - Ontology Embedding: A Survey of Methods, Applications and Resources [54.3453925775069]
Ontologies are widely used for representing domain knowledge and meta data.
One straightforward solution is to integrate statistical analysis and machine learning.
Numerous papers have been published on embedding, but a lack of systematic reviews hinders researchers from gaining a comprehensive understanding of this field.
arXiv Detail & Related papers (2024-06-16T14:49:19Z) - Seed-Guided Fine-Grained Entity Typing in Science and Engineering
Domains [51.02035914828596]
We study the task of seed-guided fine-grained entity typing in science and engineering domains.
We propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus.
It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types.
arXiv Detail & Related papers (2024-01-23T22:36:03Z) - Automated Annotation of Scientific Texts for ML-based Keyphrase
Extraction and Validation [0.0]
We present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts.
Our techniques show the potential of two new ways to leverage existing information about the unlabeled texts and the scientific domain.
arXiv Detail & Related papers (2023-11-08T22:09:31Z) - Mapping Research Trajectories [0.0]
We propose a principled approach for emphmapping research trajectories, which is applicable to all kinds of scientific entities.
Our visualizations depict the research topics of entities over time in a straightforward interpr. manner.
In a practical demonstrator application, we exemplify the proposed approach on a publication corpus from machine learning.
arXiv Detail & Related papers (2022-04-25T13:32:39Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Semantic Analysis for Automated Evaluation of the Potential Impact of
Research Articles [62.997667081978825]
This paper presents a novel method for vector representation of text meaning based on information theory.
We show how this informational semantics is used for text classification on the basis of the Leicester Scientific Corpus.
We show that an informational approach to representing the meaning of a text has offered a way to effectively predict the scientific impact of research papers.
arXiv Detail & Related papers (2021-04-26T20:37:13Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - Machine Identification of High Impact Research through Text and Image
Analysis [0.4737991126491218]
We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations.
Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions.
arXiv Detail & Related papers (2020-05-20T19:12:24Z) - Minimally Supervised Categorization of Text with Metadata [40.13841133991089]
We propose MetaCat, a minimally supervised framework to categorize text with metadata.
We develop a generative process describing the relationships between words, documents, labels, and metadata.
Based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity.
arXiv Detail & Related papers (2020-05-01T21:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.