The Open Review-Based (ORB) dataset: Towards Automatic Assessment of
Scientific Papers and Experiment Proposals in High-Energy Physics
- URL: http://arxiv.org/abs/2312.04576v1
- Date: Wed, 29 Nov 2023 20:52:02 GMT
- Title: The Open Review-Based (ORB) dataset: Towards Automatic Assessment of
Scientific Papers and Experiment Proposals in High-Energy Physics
- Authors: Jaroslaw Szumega, Lamine Bougueroua, Blerina Gkotse, Pierre Jouvelot,
Federico Ravotti
- Abstract summary: We introduce the new comprehensive Open Review-Based dataset (ORB)
It includes a curated list of more than 36,000 scientific papers with their more than 89,000 reviews and final decisions.
This paper presents our data architecture and an overview of the collected data along with relevant statistics.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the Open Science approach becoming important for research, the evolution
towards open scientific-paper reviews is making an impact on the scientific
community. However, there is a lack of publicly available resources for
conducting research activities related to this subject, as only a limited
number of journals and conferences currently allow access to their review
process for interested parties. In this paper, we introduce the new
comprehensive Open Review-Based dataset (ORB); it includes a curated list of
more than 36,000 scientific papers with their more than 89,000 reviews and
final decisions. We gather this information from two sources: the
OpenReview.net and SciPost.org websites. However, given the volatile nature of
this domain, the software infrastructure that we introduce to supplement the
ORB dataset is designed to accommodate additional resources in the future. The
ORB deliverables include (1) Python code (interfaces and implementations) to
translate document data and metadata into a structured and high-level
representation, (2) an ETL process (Extract, Transform, Load) to facilitate the
automatic updates from defined sources and (3) data files representing the
structured data. The paper presents our data architecture and an overview of
the collected data along with relevant statistics. For illustration purposes,
we also discuss preliminary Natural-Language-Processing-based experiments that
aim to predict (1) papers' acceptance based on their textual embeddings, and
(2) grading statistics inferred from embeddings as well. We believe ORB
provides a valuable resource for researchers interested in open science and
review, with our implementation easing the use of this data for further
analysis and experimentation. We plan to update ORB as the field matures as
well as introduce new resources even more fitted to dedicated scientific
domains such as High-Energy Physics.
Related papers
- SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature [80.49349719239584]
We present SciRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following demonstrations for 54 tasks.
SciRIFF is the first dataset focused on extracting and synthesizing information from research literature across a wide range of scientific fields.
arXiv Detail & Related papers (2024-06-10T21:22:08Z) - MASSW: A New Dataset and Benchmark Tasks for AI-Assisted Scientific Workflows [58.56005277371235]
We introduce MASSW, a comprehensive text dataset on Multi-Aspect Summarization of ScientificAspects.
MASSW includes more than 152,000 peer-reviewed publications from 17 leading computer science conferences spanning the past 50 years.
We demonstrate the utility of MASSW through multiple novel machine-learning tasks that can be benchmarked using this new dataset.
arXiv Detail & Related papers (2024-06-10T15:19:09Z) - Enriched BERT Embeddings for Scholarly Publication Classification [0.13654846342364302]
The NSLP 2024 FoRC Task I addresses this challenge organized as a competition.
The goal is to develop a classifier capable of predicting one of 123 predefined classes from the Open Research Knowledge Graph (ORKG) taxonomy of research fields for a given article.
arXiv Detail & Related papers (2024-05-07T09:05:20Z) - Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning [1.8270184406083445]
We explore using large language models (LLM) and prompting strategies to automatically extract dimensions from documents.
Our approach could aid data publishers and practitioners in creating machine-readable documentation.
We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results.
arXiv Detail & Related papers (2024-04-04T10:09:28Z) - All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z) - The SourceData-NLP dataset: integrating curation into scientific
publishing for training large language models [1.0423199374671421]
We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process.
This dataset contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology.
arXiv Detail & Related papers (2023-10-31T13:22:38Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Citation Trajectory Prediction via Publication Influence Representation
Using Temporal Knowledge Graph [52.07771598974385]
Existing approaches mainly rely on mining temporal and graph data from academic articles.
Our framework is composed of three modules: difference-preserved graph embedding, fine-grained influence representation, and learning-based trajectory calculation.
Experiments are conducted on both the APS academic dataset and our contributed AIPatent dataset.
arXiv Detail & Related papers (2022-10-02T07:43:26Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.