Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus
- URL: http://arxiv.org/abs/2503.23229v1
- Date: Sat, 29 Mar 2025 21:19:43 GMT
- Title: Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus
- Authors: Claas Beger, Carl-Leander Henneking,
- Abstract summary: We present Citegeist: An application pipeline using dynamic Retrieval Augmented Generation (RAG) on the arXiv Corpus.<n>For this purpose, we employ a mixture of embedding-based similarity matching, summarization, and multi-stage filtering.<n>To adapt to the continuous growth of the document base, we also present an optimized way of incorporating new and modified papers.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models provide significant new opportunities for the generation of high-quality written works. However, their employment in the research community is inhibited by their tendency to hallucinate invalid sources and lack of direct access to a knowledge base of relevant scientific articles. In this work, we present Citegeist: An application pipeline using dynamic Retrieval Augmented Generation (RAG) on the arXiv Corpus to generate a related work section and other citation-backed outputs. For this purpose, we employ a mixture of embedding-based similarity matching, summarization, and multi-stage filtering. To adapt to the continuous growth of the document base, we also present an optimized way of incorporating new and modified papers. To enable easy utilization in the scientific community, we release both, a website (https://citegeist.org), as well as an implementation harness that works with several different LLM implementations.
Related papers
- On the Merits of LLM-Based Corpus Enrichment [11.398498369228571]
We argue for a novel perspective: using genAI to enrich a document corpus.<n>The enrichment is based on modifying existing documents or generating new ones.
arXiv Detail & Related papers (2025-06-06T12:02:14Z) - MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs [54.5729817345543]
MOLE is a framework that automatically extracts metadata attributes from scientific papers covering datasets of languages other than Arabic.<n>Our methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output.
arXiv Detail & Related papers (2025-05-26T10:31:26Z) - Chatting with Papers: A Hybrid Approach Using LLMs and Knowledge Graphs [3.68389405018277]
This demo paper reports on a new workflow textitGhostWriter that combines the use of Large Language Models and Knowledge Graphs to support navigation through collections.<n>Based on the tool-suite textitEverythingData at the backend, textitGhostWriter provides an interface that enables querying and chatting'' with a collection.
arXiv Detail & Related papers (2025-05-16T18:51:51Z) - Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning [57.09163579304332]
We introduce PaperCoder, a framework that transforms machine learning papers into functional code repositories.
PaperCoder operates in three stages: planning, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files.
We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations.
arXiv Detail & Related papers (2025-04-24T01:57:01Z) - Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol [83.90769864167301]
Literature review tables are essential for summarizing and comparing collections of scientific papers.
We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers.
Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques.
arXiv Detail & Related papers (2025-04-14T14:52:28Z) - HLM-Cite: Hybrid Language Model Workflow for Text-based Scientific Citation Prediction [14.731720495144112]
We introduce the novel concept of core citation, which identifies the critical references that go beyond superficial mentions.
We propose $textbfHLM-Cite, a $textbfH$ybrid $textbfL$anguage $textbfM$odel workflow for citation prediction.
We evaluate HLM-Cite across 19 scientific fields, demonstrating a 17.6% performance improvement comparing SOTA methods.
arXiv Detail & Related papers (2024-10-10T10:46:06Z) - SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z) - Explaining Relationships Among Research Papers [14.223038413516685]
We propose a feature-based, LLM-prompting approach to generate richer citation texts.
We find a strong correlation between human preference and integrative writing style, suggesting that humans prefer high-level, abstract citations.
arXiv Detail & Related papers (2024-02-20T23:38:39Z) - LitLLM: A Toolkit for Scientific Literature Review [15.785989492351684]
We propose a toolkit that operates on Retrieval Augmented Generation (RAG) principles.<n>Our system first initiates a web search to retrieve relevant papers.<n>Second, the system re-ranks the retrieved papers based on the user-provided abstract.<n>Third, the related work section is generated based on the re-ranked results and the abstract.
arXiv Detail & Related papers (2024-02-02T02:41:28Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - Stretching Sentence-pair NLI Models to Reason over Long Documents and
Clusters [35.103851212995046]
Natural Language Inference (NLI) has been extensively studied by the NLP community as a framework for estimating the semantic relation between sentence pairs.
We explore the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on.
We develop new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset.
arXiv Detail & Related papers (2022-04-15T12:56:39Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Generating Related Work [37.161925758727456]
We model generating related work sections while being cognisant of the motivation behind citing papers.
Our model outperforms several strong state-of-the-art summarization and multi-document summarization models.
arXiv Detail & Related papers (2021-04-18T00:19:37Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.