ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations
- URL: http://arxiv.org/abs/2504.00824v2
- Date: Thu, 03 Apr 2025 15:07:29 GMT
- Title: ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations
- Authors: Yubo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, Wenhu Chen,
- Abstract summary: We introduce ScholarCopilot, a unified framework designed to enhance existing large language models for academic writing.<n> ScholarCopilot determines when to retrieve scholarly references by generating a retrieval token [RET], which is then used to query a citation database.<n>We jointly optimize both the generation and citation tasks within a single framework to improve efficiency.
- Score: 45.57178343138677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Academic writing requires both coherent text generation and precise citation of relevant literature. Although recent Retrieval-Augmented Generation (RAG) systems have significantly improved factual accuracy in general-purpose text generation, their ability to support professional academic writing remains limited. In this work, we introduce ScholarCopilot, a unified framework designed to enhance existing large language models for generating professional academic articles with accurate and contextually relevant citations. ScholarCopilot dynamically determines when to retrieve scholarly references by generating a retrieval token [RET], which is then used to query a citation database. The retrieved references are fed into the model to augment the generation process. We jointly optimize both the generation and citation tasks within a single framework to improve efficiency. Our model is built upon Qwen-2.5-7B and trained on 500K papers from arXiv. It achieves a top-1 retrieval accuracy of 40.1% on our evaluation dataset, outperforming baselines such as E5-Mistral-7B-Instruct (15.0%) and BM25 (9.8%). On a dataset of 1,000 academic writing samples, ScholarCopilot scores 16.2/25 in generation quality -- measured across relevance, coherence, academic rigor, completeness, and innovation -- significantly surpassing all existing models, including much larger ones like the Retrieval-Augmented Qwen2.5-72B-Instruct. Human studies further demonstrate that ScholarCopilot, despite being a 7B model, significantly outperforms ChatGPT, achieving 100% preference in citation quality and over 70% in overall usefulness.
Related papers
- Comprehensive Manuscript Assessment with Text Summarization Using 69707 articles [10.943765373420135]
We harness Scopus to curate a significantly comprehensive and large-scale dataset of information from 69707 scientific articles.<n>We propose a deep learning methodology for the impact-based classification tasks, which leverages semantic features extracted from the manuscripts and paper metadata.
arXiv Detail & Related papers (2025-03-26T07:56:15Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - DocReLM: Mastering Document Retrieval with Language Model [49.847369507694154]
We demonstrate that by utilizing large language models, a document retrieval system can achieve advanced semantic understanding capabilities.
Our approach involves training the retriever and reranker using domain-specific data generated by large language models.
We use a test set annotated by academic researchers in the fields of quantum physics and computer vision to evaluate our system's performance.
arXiv Detail & Related papers (2024-05-19T06:30:22Z) - KG-CTG: Citation Generation through Knowledge Graph-guided Large Language Models [35.80247519023821]
Citation Text Generation (CTG) is a task in natural language processing (NLP) that aims to produce text that accurately cites or references a cited document within a source document.
This paper presents a framework, and a comparative study to demonstrate the use of Large Language Models (LLMs) for the task of citation generation.
arXiv Detail & Related papers (2024-04-15T13:06:32Z) - Language Models for Code Completion: A Practical Evaluation [13.174471984950857]
This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code.
We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions.
We found that 66.3% of failures were due to the models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote.
arXiv Detail & Related papers (2024-02-25T20:43:55Z) - MIReAD: Simple Method for Learning High-quality Representations from
Scientific Documents [77.34726150561087]
We propose MIReAD, a simple method that learns high-quality representations of scientific papers.
We train MIReAD on more than 500,000 PubMed and arXiv abstracts across over 2,000 journal classes.
arXiv Detail & Related papers (2023-05-07T03:29:55Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - How Large Language Models are Transforming Machine-Paraphrased
Plagiarism [3.8768839735240737]
This work explores T5 and GPT-3 for machine-paraphrase generation on scientific articles from arXiv, student theses, and Wikipedia.
We evaluate the detection performance of six automated solutions and one commercial plagiarism detection software.
Human experts rate the quality of paraphrases generated by GPT-3 as high as original texts.
arXiv Detail & Related papers (2022-10-07T14:08:57Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - Structure-Tags Improve Text Classification for Scholarly Document
Quality Prediction [4.4641025448898475]
We propose the use of HANs combined with structure-tags which mark the role of sentences in the document.
Adding tags to sentences, marking them as corresponding to title, abstract or main body text, yields improvements over the state-of-the-art for scholarly document quality prediction.
arXiv Detail & Related papers (2020-04-30T22:34:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.