Related papers: CoCon: A Data Set on Combined Contextualized Research Artifact Use

CoCon: A Data Set on Combined Contextualized Research Artifact Use

URL: http://arxiv.org/abs/2303.15193v1
Date: Mon, 27 Mar 2023 13:29:09 GMT
Title: CoCon: A Data Set on Combined Contextualized Research Artifact Use
Authors: Tarek Saier and Youxiang Dong and Michael F\"arber
Abstract summary: CoCon is a large scholarly data set reflecting the combined use of research artifacts in academic publications' full-text. Our data set comprises 35 k artifacts (data sets, methods, models, and tasks) and 340 k publications. We formalize a link prediction task for "combined research artifact use prediction" and provide code to utilize analyses of and the development of ML applications on our data.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the wake of information overload in academia, methodologies and systems for search, recommendation, and prediction to aid researchers in identifying relevant research are actively studied and developed. Existing work, however, is limited in terms of granularity, focusing only on the level of papers or a single type of artifact, such as data sets. To enable more holistic analyses and systems dealing with academic publications and their content, we propose CoCon, a large scholarly data set reflecting the combined use of research artifacts, contextualized in academic publications' full-text. Our data set comprises 35 k artifacts (data sets, methods, models, and tasks) and 340 k publications. We additionally formalize a link prediction task for "combined research artifact use prediction" and provide code to utilize analyses of and the development of ML applications on our data. All data and code is publicly available at https://github.com/IllDepence/contextgraph.

Related papers

Open Datasets in Learning Analytics: Trends, Challenges, and Best PRACTICE [0.4666493857924357]
Open datasets play a crucial role in three research domains that intersect data science and education: learning analytics, educational data mining, and artificial intelligence in education.<n>Providing open datasets alongside research papers supports, collaboration, and trust in research findings.<n>Despite these advantages, the availability of open datasets and associated practices within the learning analytics research communities, especially at their flagship conference venues, remain unclear.
arXiv Detail & Related papers (2026-02-19T12:23:25Z)
Intelligent Scientific Literature Explorer using Machine Learning (ISLE) [0.797970449705065]
This paper presents an integrated system for scientific literature exploration that combines large-scale data acquisition, hybrid retrieval, semantic topic modeling, and heterogeneous knowledge graph construction.<n>The proposed framework contributes a foundation for AI-assisted scientific discovery.
arXiv Detail & Related papers (2025-12-14T16:54:24Z)
CS-PaperSum: A Large-Scale Dataset of AI-Generated Summaries for Scientific Papers [3.929864777332447]
CS-PaperSum is a large-scale dataset of 91,919 papers from 31 top-tier computer science conferences. Our dataset enables automated literature analysis, research trend forecasting, and AI-driven scientific discovery.
arXiv Detail & Related papers (2025-02-27T22:48:35Z)
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z)
Decoding MIE: A Novel Dataset Approach Using Topic Extraction and Affiliation Parsing [0.0]
This study introduces a novel dataset derived from the Medical Informatics Europe (MIE) Conference proceedings. We extracted and processed metadata and abstract from 4,606 articles published in the "Studies in Health Technology and Informatics" journal series.
arXiv Detail & Related papers (2024-10-06T19:34:23Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset. Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive. Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z)
Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature. We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z)
DeepShovel: An Online Collaborative Platform for Data Extraction in Geoscience Literature with AI Assistance [48.55345030503826]
Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data. DeepShovel is a publicly-available AI-assisted data extraction system to support their needs. A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
arXiv Detail & Related papers (2022-02-21T12:18:08Z)
A Survey on Machine Learning Techniques for Source Code Analysis [14.129976741300029]
We aim to summarize the current knowledge in the area of applied machine learning for source code analysis. To do so, we carried out an extensive literature search and identified 364 primary studies published between 2002 and 2021.
arXiv Detail & Related papers (2021-10-18T20:13:38Z)
Deep Learning Schema-based Event Extraction: Literature Review and Current Trends [60.29289298349322]
Event extraction technology based on deep learning has become a research hotspot. This paper fills the gap by reviewing the state-of-the-art approaches, focusing on deep learning-based models.
arXiv Detail & Related papers (2021-07-05T16:32:45Z)
Topic Space Trajectories: A case study on machine learning literature [0.0]
We present topic space trajectories, a structure that allows for the comprehensible tracking of research topics. We show the applicability of our approach on a publication corpus spanning 50 years of machine learning research from 32 publication venues. Our novel analysis method may be employed for paper classification, for the prediction of future research topics, and for the recommendation of fitting conferences and journals for submitting unpublished work.
arXiv Detail & Related papers (2020-10-23T10:53:42Z)
Machine Identification of High Impact Research through Text and Image Analysis [0.4737991126491218]
We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations. Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions.
arXiv Detail & Related papers (2020-05-20T19:12:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.