MIReAD: Simple Method for Learning High-quality Representations from
Scientific Documents
- URL: http://arxiv.org/abs/2305.04177v1
- Date: Sun, 7 May 2023 03:29:55 GMT
- Title: MIReAD: Simple Method for Learning High-quality Representations from
Scientific Documents
- Authors: Anastasia Razdaibiedina, Alexander Brechalov
- Abstract summary: We propose MIReAD, a simple method that learns high-quality representations of scientific papers.
We train MIReAD on more than 500,000 PubMed and arXiv abstracts across over 2,000 journal classes.
- Score: 77.34726150561087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning semantically meaningful representations from scientific documents
can facilitate academic literature search and improve performance of
recommendation systems. Pre-trained language models have been shown to learn
rich textual representations, yet they cannot provide powerful document-level
representations for scientific articles. We propose MIReAD, a simple method
that learns high-quality representations of scientific papers by fine-tuning
transformer model to predict the target journal class based on the abstract. We
train MIReAD on more than 500,000 PubMed and arXiv abstracts across over 2,000
journal classes. We show that MIReAD produces representations that can be used
for similar papers retrieval, topic categorization and literature search. Our
proposed approach outperforms six existing models for representation learning
on scientific documents across four evaluation standards.
Related papers
- DocReLM: Mastering Document Retrieval with Language Model [49.847369507694154]
We demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities.
Our approach involves training the retriever and reranker using domain-specific data generated by large language models.
We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance.
arXiv Detail & Related papers (2024-05-19T06:30:22Z) - OpenMSD: Towards Multilingual Scientific Documents Similarity
Measurement [11.602151258188862]
We develop and evaluate multilingual scientific documents similarity measurement models in this work.
We propose the first multilingual scientific documents dataset, Open-access Multilingual Scientific Documents (OpenMSD), which has 74M papers in 103 languages and 778 citation pairs.
arXiv Detail & Related papers (2023-09-19T11:38:39Z) - SciRepEval: A Multi-Format Benchmark for Scientific Document
Representations [52.01865318382197]
We introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations.
We show how state-of-the-art models like SPECTER and SciNCL struggle to generalize across the task formats.
A new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance.
arXiv Detail & Related papers (2022-11-23T21:25:39Z) - SimCPSR: Simple Contrastive Learning for Paper Submission Recommendation
System [0.0]
This study proposes a transformer-based model using transfer learning as an efficient approach for the paper submission recommendation system.
By combining essential information (such as the title, the abstract, and the list of keywords) with the aims and scopes of journals, the model can recommend the Top K journals that maximize the acceptance of the paper.
arXiv Detail & Related papers (2022-05-12T08:08:22Z) - Knowledge Graph informed Fake News Classification via Heterogeneous
Representation Ensembles [1.8374319565577157]
We show how different document representations can be used for efficient fake news identification.
One of the key contributions is a set of novel document representation learning methods based solely on knowledge graphs.
We demonstrate that knowledge graph-based representations already achieve competitive performance to conventionally accepted representation learners.
arXiv Detail & Related papers (2021-10-20T09:41:14Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.