unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including
Structured Full-Text and Citation Network
- URL: http://arxiv.org/abs/2303.14957v1
- Date: Mon, 27 Mar 2023 07:40:59 GMT
- Title: unarXive 2022: All arXiv Publications Pre-Processed for NLP, Including
Structured Full-Text and Citation Network
- Authors: Tarek Saier and Johan Krause and Michael F\"arber
- Abstract summary: We propose a new version of the data set unarXive.
The resulting data set comprises 1.9 M publications spanning multiple disciplines and 32 years.
In addition to the data set, we provide ready-to-use training/test data for citation recommendation and IMRaD classification.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale data sets on scholarly publications are the basis for a variety
of bibliometric analyses and natural language processing (NLP) applications.
Especially data sets derived from publication's full-text have recently gained
attention. While several such data sets already exist, we see key shortcomings
in terms of their domain and time coverage, citation network completeness, and
representation of full-text content. To address these points, we propose a new
version of the data set unarXive. We base our data processing pipeline and
output format on two existing data sets, and improve on each of them. Our
resulting data set comprises 1.9 M publications spanning multiple disciplines
and 32 years. It furthermore has a more complete citation network than its
predecessors and retains a richer representation of document structure as well
as non-textual publication content such as mathematical notation. In addition
to the data set, we provide ready-to-use training/test data for citation
recommendation and IMRaD classification. All data and source code is publicly
available at https://github.com/IllDepence/unarXive.
Related papers
- AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing [82.33075210051129]
We introduce AceParse, the first comprehensive dataset designed to support the parsing of structured texts.
Based on AceParse, we fine-tuned a multimodal model, named Ace, which accurately parses various structured texts.
This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity.
arXiv Detail & Related papers (2024-09-16T06:06:34Z) - [Citation needed] Data usage and citation practices in medical imaging conferences [1.9702506447163306]
We present two open-source tools that could help with the detection of dataset usage.
We studied the usage of 20 publicly available medical datasets in papers from MICCAI and MIDL.
Our findings demonstrate the concentration of the usage of a limited set of datasets.
arXiv Detail & Related papers (2024-02-05T13:41:22Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - Open Domain Question Answering over Virtual Documents: A Unified
Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA)
Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources.
We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z) - DocNLI: A Large-scale Dataset for Document-level Natural Language
Inference [55.868482696821815]
Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems.
This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI.
arXiv Detail & Related papers (2021-06-17T13:02:26Z) - Documenting the English Colossal Clean Crawled Corpus [28.008953329187648]
This work provides the first documentation for the Colossal Clean Crawled Corpus (C4; Raffel et al., 2020), a dataset created by applying a set of filters to a single snapshot of Common Crawl.
We begin with a high-level summary of the data, including distributions of where the text came from and when it was written.
We then give more detailed analysis on salient parts of this data, including the most frequent sources of text.
arXiv Detail & Related papers (2021-04-18T07:42:52Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Machine Identification of High Impact Research through Text and Image
Analysis [0.4737991126491218]
We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations.
Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions.
arXiv Detail & Related papers (2020-05-20T19:12:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.