Cleaning English Abstracts of Scientific Publications
- URL: http://arxiv.org/abs/2512.24459v1
- Date: Tue, 30 Dec 2025 20:45:50 GMT
- Title: Cleaning English Abstracts of Scientific Publications
- Authors: Michael E. Rose, Nils A. Herrmann, Sebastian Erhardt,
- Abstract summary: We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts.<n>We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.
- Score: 0.15293427903448018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.
Related papers
- Loci Similes: A Benchmark for Extracting Intertextualities in Latin Literature [4.132158161225706]
Loci Similes is a benchmark for Latin intertextuality detection comprising of a curated dataset of 172k text segments containing 545 expert-verified parallels linking Late Antique authors to a corpus of classical authors.<n>We establish baselines for retrieval and classification of intertextualities with state-of-the-art LLMs.
arXiv Detail & Related papers (2026-01-12T13:34:49Z) - Citation Parsing and Analysis with Language Models [0.0]
We investigate the capacity of open-weight language models to mark up manuscript citations in an indexable format.<n>We find that, even out of the box, today's language models achieve high levels of accuracy on identifying the constituent components of each citation.
arXiv Detail & Related papers (2025-05-21T19:06:17Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Citance-Contextualized Summarization of Scientific Papers [33.85387549129378]
Abstracts are not intended to show the relationship between a paper and the references cited in it.
We propose a new contextualized summarization approach that can generate an informative summary conditioned on a given sentence containing the citation of a reference.
arXiv Detail & Related papers (2023-11-04T14:08:15Z) - MORTY: Structured Summarization for Targeted Information Extraction from
Scholarly Articles [0.0]
We present MORTY, an information extraction technique that creates structured summaries of text from scholarly articles.
Our approach condenses the article's full-text to property-value pairs as a segmented text snippet called structured summary.
We also present a sizable scholarly dataset combining structured summaries retrieved from a scholarly knowledge graph and corresponding publicly available scientific articles.
arXiv Detail & Related papers (2022-12-11T06:49:29Z) - Scientific Paper Extractive Summarization Enhanced by Citation Graphs [50.19266650000948]
We focus on leveraging citation graphs to improve scientific paper extractive summarization under different settings.
Preliminary results demonstrate that citation graph is helpful even in a simple unsupervised framework.
Motivated by this, we propose a Graph-based Supervised Summarization model (GSS) to achieve more accurate results on the task when large-scale labeled data are available.
arXiv Detail & Related papers (2022-12-08T11:53:12Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - Enhancing Scientific Papers Summarization with Citation Graph [78.65955304229863]
We redefine the task of scientific papers summarization by utilizing their citation graph.
We construct a novel scientific papers summarization dataset Semantic Scholar Network (SSN) which contains 141K research papers in different domains.
Our model can achieve competitive performance when compared with the pretrained models.
arXiv Detail & Related papers (2021-04-07T11:13:35Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z) - StructSum: Summarization via Structured Representations [27.890477913486787]
Abstractive text summarization aims at compressing the information of a long source document into a condensed summary.
Despite advances in modeling techniques, abstractive summarization models still suffer from several key challenges.
We propose a framework based on document-level structure induction for summarization to address these challenges.
arXiv Detail & Related papers (2020-03-01T20:32:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.