SMAuC -- The Scientific Multi-Authorship Corpus
- URL: http://arxiv.org/abs/2211.02477v2
- Date: Wed, 10 May 2023 12:21:38 GMT
- Title: SMAuC -- The Scientific Multi-Authorship Corpus
- Authors: Janek Bevendorff, Philipp Sauer, Lukas Gienapp, Wolfgang Kircheis,
Erik K\"orner, Benno Stein, Martin Potthast
- Abstract summary: We introduce SMAuC, a comprehensive, metadata-rich corpus tailored to scientific authorship analysis.
Comprising over 3 million publications across various disciplines from over 5 million authors, SMAuC is the largest openly accessible corpus for this purpose.
- Score: 32.77279821297011
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapidly growing volume of scientific publications offers an interesting
challenge for research on methods for analyzing the authorship of documents
with one or more authors. However, most existing datasets lack scientific
documents or the necessary metadata for constructing new experiments and test
cases. We introduce SMAuC, a comprehensive, metadata-rich corpus tailored to
scientific authorship analysis. Comprising over 3 million publications across
various disciplines from over 5 million authors, SMAuC is the largest openly
accessible corpus for this purpose. It encompasses scientific texts from
humanities and natural sciences, accompanied by extensive, curated metadata,
including unambiguous author IDs. SMAuC aims to significantly advance the
domain of authorship analysis in scientific texts.
Related papers
- SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions [52.35520385083425]
We present SciDMT, an enhanced and expanded corpus for scientific mention detection.
The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes.
arXiv Detail & Related papers (2024-06-20T22:03:21Z) - A Comprehensive Survey of Scientific Large Language Models and Their Applications in Scientific Discovery [68.48094108571432]
Large language models (LLMs) have revolutionized the way text and other modalities of data are handled.
We aim to provide a more holistic view of the research landscape by unveiling cross-field and cross-modal connections between scientific LLMs.
arXiv Detail & Related papers (2024-06-16T08:03:24Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - A Survey of Decomposition-Based Evolutionary Multi-Objective Optimization: Part II -- A Data Science Perspective [4.322038460697958]
We build a knowledge graph that encapsulates more than 5,400 papers, 10,000 authors, 400 venues, and 1,600 institutions for MOEA/D research.
We also explore the collaboration and citation networks of MOEA/D, uncovering hidden patterns in the growth of literature.
arXiv Detail & Related papers (2024-04-22T14:38:58Z) - Uni-SMART: Universal Science Multimodal Analysis and Research Transformer [22.90687836544612]
We present bfUni-text, an innovative model designed for in-depth understanding of scientific literature.
Uni-text demonstrates superior performance over other text-focused LLMs.
Our exploration extends to practical applications, including patent infringement detection and nuanced analysis of charts.
arXiv Detail & Related papers (2024-03-15T13:43:47Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Modeling Information Change in Science Communication with Semantically
Matched Paraphrases [50.67030449927206]
SPICED is the first paraphrase dataset of scientific findings annotated for degree of information change.
SPICED contains 6,000 scientific finding pairs extracted from news stories, social media discussions, and full texts of original papers.
Models trained on SPICED improve downstream performance on evidence retrieval for fact checking of real-world scientific claims.
arXiv Detail & Related papers (2022-10-24T07:44:38Z) - Overview of STEM Science as Process, Method, Material, and Data Named
Entities [0.0]
We develop and analyze a large-scale structured dataset of STEM articles across 10 different disciplines.
Our analysis is defined over a large-scale corpus comprising 60K abstracts structured as four scientific entities process, method, material, and data.
The STEM-NER-60k corpus, created in this work, comprises over 1M extracted entities from 60k STEM articles obtained from a major publishing platform.
arXiv Detail & Related papers (2022-05-24T07:35:24Z) - TDMSci: A Specialized Corpus for Scientific Literature Entity Tagging of
Tasks Datasets and Metrics [32.4845534482475]
We present a new corpus that contains domain expert annotations for Task (T), dataset (D), Metric (M) entities on 2,000 sentences extracted from NLP papers.
We report experiment results on TDM extraction using a simple data augmentation strategy and apply our tagger to around 30,000 NLP papers from the ACL.
arXiv Detail & Related papers (2021-01-25T17:54:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.