ForeCite: Adapting Pre-Trained Language Models to Predict Future Citation Rates of Academic Papers
- URL: http://arxiv.org/abs/2505.08941v1
- Date: Tue, 13 May 2025 20:10:00 GMT
- Title: ForeCite: Adapting Pre-Trained Language Models to Predict Future Citation Rates of Academic Papers
- Authors: Gavin Hull, Alex Bihlo,
- Abstract summary: We present ForeCite, a framework to predict the future citation rates of academic papers.<n>ForeCite achieves a test correlation of $rho = 0.826$ on a curated dataset of 900K+ biomedical papers published between 2000 and 2024.<n>These results establish a new state-of-the-art in forecasting the long-term influence of academic research and lay the groundwork for the automated, high-fidelity evaluation of scientific contributions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting the future citation rates of academic papers is an important step toward the automation of research evaluation and the acceleration of scientific progress. We present $\textbf{ForeCite}$, a simple but powerful framework to append pre-trained causal language models with a linear head for average monthly citation rate prediction. Adapting transformers for regression tasks, ForeCite achieves a test correlation of $\rho = 0.826$ on a curated dataset of 900K+ biomedical papers published between 2000 and 2024, a 27-point improvement over the previous state-of-the-art. Comprehensive scaling-law analysis reveals consistent gains across model sizes and data volumes, while temporal holdout experiments confirm practical robustness. Gradient-based saliency heatmaps suggest a potentially undue reliance on titles and abstract texts. These results establish a new state-of-the-art in forecasting the long-term influence of academic research and lay the groundwork for the automated, high-fidelity evaluation of scientific contributions.
Related papers
- In-depth Research Impact Summarization through Fine-Grained Temporal Citation Analysis [52.42612945266194]
We propose a new task: generating nuanced, expressive, and time-aware impact summaries.<n>We show that these summaries capture both praise (confirmation citations) and critique (correction citations) through the evolution of fine-grained citation intents.
arXiv Detail & Related papers (2025-05-20T19:11:06Z) - ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations [45.57178343138677]
We introduce ScholarCopilot, a unified framework designed to enhance existing large language models for academic writing.<n> ScholarCopilot determines when to retrieve scholarly references by generating a retrieval token [RET], which is then used to query a citation database.<n>We jointly optimize both the generation and citation tasks within a single framework to improve efficiency.
arXiv Detail & Related papers (2025-04-01T14:12:14Z) - Comprehensive Manuscript Assessment with Text Summarization Using 69707 articles [10.943765373420135]
We harness Scopus to curate a significantly comprehensive and large-scale dataset of information from 69707 scientific articles.<n>We propose a deep learning methodology for the impact-based classification tasks, which leverages semantic features extracted from the manuscripts and paper metadata.
arXiv Detail & Related papers (2025-03-26T07:56:15Z) - Optimizing Research Portfolio For Semantic Impact [55.2480439325792]
Citation metrics are widely used to assess academic impact but suffer from social biases.<n>We introduce rXiv Semantic Impact (XSI), a novel framework that predicts research impact.<n>XSI tracks the evolution of research concepts in the academic knowledge graph.
arXiv Detail & Related papers (2025-02-19T17:44:13Z) - WithdrarXiv: A Large-Scale Dataset for Retraction Study [33.782357627001154]
We present WithdrarXiv, the first large-scale dataset of withdrawn papers from arXiv.<n>We develop a comprehensive taxonomy of retraction reasons, identifying 10 distinct categories ranging from critical errors to policy violations.<n>We demonstrate a simple yet highly accurate zero-shot automatic categorization of retraction reasons, achieving a weighted average F1-score of 0.96.
arXiv Detail & Related papers (2024-12-04T23:36:23Z) - Machine Learning to Promote Translational Research: Predicting Patent
and Clinical Trial Inclusion in Dementia Research [0.0]
Projected to impact 1.6 million people in the UK by 2040 and costing pounds25 billion annually, dementia presents a growing challenge to society.
We used the Dimensions database to extract data from 43,091 UK dementia research publications between the years 1990-2023.
For patent predictions, an Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.84 and 77.17% accuracy; for clinical trial predictions, an AUROC of 0.81 and 75.11% accuracy.
arXiv Detail & Related papers (2024-01-10T13:25:49Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - Deep forecasting of translational impact in medical research [1.8130872753848115]
We develop a suite of representational and discriminative mathematical models of multi-scale publication data.
We show that citations are only moderately predictive of translational impact as judged by inclusion in patents, guidelines, or policy documents.
We argue that content-based models of impact are superior in performance to conventional, citation-based measures.
arXiv Detail & Related papers (2021-10-17T19:29:41Z) - Back2Future: Leveraging Backfill Dynamics for Improving Real-time
Predictions in Future [73.03458424369657]
In real-time forecasting in public health, data collection is a non-trivial and demanding task.
'Backfill' phenomenon and its effect on model performance has been barely studied in the prior literature.
We formulate a novel problem and neural framework Back2Future that aims to refine a given model's predictions in real-time.
arXiv Detail & Related papers (2021-06-08T14:48:20Z) - Semantic Analysis for Automated Evaluation of the Potential Impact of
Research Articles [62.997667081978825]
This paper presents a novel method for vector representation of text meaning based on information theory.
We show how this informational semantics is used for text classification on the basis of the Leicester Scientific Corpus.
We show that an informational approach to representing the meaning of a text has offered a way to effectively predict the scientific impact of research papers.
arXiv Detail & Related papers (2021-04-26T20:37:13Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - A Survey on Text Classification: From Shallow to Deep Learning [83.47804123133719]
The last decade has seen a surge of research in this area due to the unprecedented success of deep learning.
This paper fills the gap by reviewing the state-of-the-art approaches from 1961 to 2021.
We create a taxonomy for text classification according to the text involved and the models used for feature extraction and classification.
arXiv Detail & Related papers (2020-08-02T00:09:03Z) - AGATHA: Automatic Graph-mining And Transformer based Hypothesis
generation Approach [1.7954335118363964]
We present a hypothesis generation system that can introduce data-driven insights earlier in the discovery process.
AGATHA prioritizes plausible term-pairs among entity sets, allowing us to recommend new research directions.
This system achieves best-in-class performance on an established benchmark.
arXiv Detail & Related papers (2020-02-13T17:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.