Citation Parsing and Analysis with Language Models
- URL: http://arxiv.org/abs/2505.15948v1
- Date: Wed, 21 May 2025 19:06:17 GMT
- Title: Citation Parsing and Analysis with Language Models
- Authors: Parth Sarin, Juan Pablo Alperin,
- Abstract summary: We investigate the capacity of open-weight language models to mark up manuscript citations in an indexable format.<n>We find that, even out of the box, today's language models achieve high levels of accuracy on identifying the constituent components of each citation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key type of resource needed to address global inequalities in knowledge production and dissemination is a tool that can support journals in understanding how knowledge circulates. The absence of such a tool has resulted in comparatively less information about networks of knowledge sharing in the Global South. In turn, this gap authorizes the exclusion of researchers and scholars from the South in indexing services, reinforcing colonial arrangements that de-center and minoritize those scholars. In order to support citation network tracking on a global scale, we investigate the capacity of open-weight language models to mark up manuscript citations in an indexable format. We assembled a dataset of matched plaintext and annotated citations from preprints and published research papers. Then, we evaluated a number of open-weight language models on the annotation task. We find that, even out of the box, today's language models achieve high levels of accuracy on identifying the constituent components of each citation, outperforming state-of-the-art methods. Moreover, the smallest model we evaluated, Qwen3-0.6B, can parse all fields with high accuracy in $2^5$ passes, suggesting that post-training is likely to be effective in producing small, robust citation parsing models. Such a tool could greatly improve the fidelity of citation networks and thus meaningfully improve research indexing and discovery, as well as further metascientific research.
Related papers
- The OCR Quest for Generalization: Learning to recognize low-resource alphabets with model editing [2.7471068141502]
We aim for building models which can generalize to new distributions of data, such as alphabets, faster than centralized fine-tune strategies.<n>In contrast to state-of-the-art meta-learning, we showcase the effectiveness of domain merging in sparse distributions of data.<n>This research contributes a novel approach into building models that can easily adopt under-represented alphabets.
arXiv Detail & Related papers (2025-06-07T11:05:33Z) - Detecting Reference Errors in Scientific Literature with Large Language Models [0.552480439325792]
This work evaluated the ability of large language models in OpenAI's GPT family to detect quotation errors.
Results showed that large language models are able to detect erroneous citations with limited context and without fine-tuning.
arXiv Detail & Related papers (2024-11-09T07:30:38Z) - Monolingual and Multilingual Misinformation Detection for Low-Resource Languages: A Comprehensive Survey [2.5459710368096586]
Misinformation transcends linguistic boundaries, posing a challenge for moderation systems.<n>Most approaches to misinformation detection are monolingual, focused on high-resource languages.<n>This survey provides a comprehensive overview of the current research on misinformation detection in low-resource languages.
arXiv Detail & Related papers (2024-10-24T03:02:03Z) - CiteFusion: An Ensemble Framework for Citation Intent Classification Harnessing Dual-Model Binary Couples and SHAP Analyses [1.7812428873698407]
This study introduces CiteFusion, an ensemble framework designed to address the multiclass Citation Intent Classification task.<n>CiteFusion achieves state-of-the-art performance, with Macro-F1 scores of 89.60% on SciCite and 76.24% on ACL-ARC.<n>We release a web-based application that classifies citation intents leveraging CiteFusion models developed on SciCite.
arXiv Detail & Related papers (2024-07-18T09:29:33Z) - Modeling citation worthiness by using attention-based bidirectional long short-term memory networks and interpretable models [0.0]
We propose a Bidirectional Long Short-Term Memory (BiLSTM) network with attention mechanism and contextual information to detect sentences that need citations.
We produce a new, large dataset (PMOA-CITE) based on PubMed Open Access Subset, which is orders of magnitude larger than previous datasets.
arXiv Detail & Related papers (2024-05-20T17:45:36Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations [34.99831757956635]
We formulate the task of attributed query-focused summarization (AQFS) and present WebCiteS, a Chinese dataset featuring 7k human-annotated summaries with citations.
We tackle these issues by developing detailed metrics and enabling the automatic evaluator to decompose the sentences into sub-claims for fine-grained verification.
arXiv Detail & Related papers (2024-03-04T07:06:41Z) - Understanding Survey Paper Taxonomy about Large Language Models via
Graph Representation Learning [2.88268082568407]
We develop a method to automatically assign survey papers to a taxonomy.
Our work indicates that leveraging graph structure information on co-category graphs can significantly outperform the language models.
arXiv Detail & Related papers (2024-02-16T02:21:59Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z) - Deep Graph Learning for Anomalous Citation Detection [55.81334139806342]
We propose a novel deep graph learning model, namely GLAD (Graph Learning for Anomaly Detection), to identify anomalies in citation networks.
Within the GLAD framework, we propose an algorithm called CPU (Citation PUrpose) to discover the purpose of citation based on citation texts.
arXiv Detail & Related papers (2022-02-23T09:05:28Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.