DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance
- URL: http://arxiv.org/abs/2202.10163v2
- Date: Thu, 24 Feb 2022 05:39:59 GMT
- Title: DeepShovel: An Online Collaborative Platform for Data Extraction in
Geoscience Literature with AI Assistance
- Authors: Shao Zhang, Yuting Jia, Hui Xu, Ying Wen, Dakuo Wang, Xinbing Wang
- Abstract summary: Geoscientists need to read a huge amount of literature to locate, extract, and aggregate relevant results and data.
DeepShovel is a publicly-available AI-assisted data extraction system to support their needs.
A follow-up user evaluation with 14 researchers suggested DeepShovel improved users' efficiency of data extraction for building scientific databases.
- Score: 48.55345030503826
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Geoscientists, as well as researchers in many fields, need to read a huge
amount of literature to locate, extract, and aggregate relevant results and
data to enable future research or to build a scientific database, but there is
no existing system to support this use case well. In this paper, based on the
findings of a formative study about how geoscientists collaboratively annotate
literature and extract and aggregate data, we proposed DeepShovel, a
publicly-available AI-assisted data extraction system to support their needs.
DeepShovel leverages the state-of-the-art neural network models to support
researcher(s) easily and accurately annotate papers (in the PDF format) and
extract data from tables, figures, maps, etc. in a human-AI collaboration
manner. A follow-up user evaluation with 14 researchers suggested DeepShovel
improved users' efficiency of data extraction for building scientific
databases, and encouraged teams to form a larger scale but more tightly-coupled
collaboration.
Related papers
- Pennsieve: A Collaborative Platform for Translational Neuroscience and Beyond [0.5130659559809153]
Pennsieve is an open-source, cloud-based scientific data management platform.
It supports complex multimodal datasets and provides tools for data visualization and analyses.
Pennsieve stores over 125 TB of scientific data, with 35 TB of data publicly available across more than 350 high-impact datasets.
arXiv Detail & Related papers (2024-09-16T17:55:58Z) - Human-artificial intelligence teaming for scientific information extraction from data-driven additive manufacturing research using large language models [3.0061386772253784]
Data-driven research in Additive Manufacturing (AM) has gained significant success in recent years.
This has led to a plethora of scientific literature to emerge.
It requires substantial effort and time to extract scientific information from these works.
We propose a framework that enables collaboration between AM and AI experts to continuously extract scientific information from data-driven AM literature.
arXiv Detail & Related papers (2024-07-26T15:43:52Z) - An Autonomous GIS Agent Framework for Geospatial Data Retrieval [0.0]
This study proposes an autonomous GIS agent framework capable of retrieving required geospatial data.
We developed a prototype agent based on the framework, released as a QGIS plugin (GeoData Retrieve Agent) and a Python program.
Experiment results demonstrate its capability of retrieving data from various sources including OpenStreetMap, administrative boundaries and demographic data from the US Census Bureau.
arXiv Detail & Related papers (2024-07-13T14:23:57Z) - A Survey on Data Selection for Language Models [148.300726396877]
Data selection methods aim to determine which data points to include in a training dataset.
Deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive.
Few organizations have the resources for extensive data selection research.
arXiv Detail & Related papers (2024-02-26T18:54:35Z) - GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training
Data Exploration [97.68234051078997]
We discuss how Pyserini can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts.
We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub.
We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections.
arXiv Detail & Related papers (2023-06-02T12:09:59Z) - Assessing Scientific Contributions in Data Sharing Spaces [64.16762375635842]
This paper introduces the SCIENCE-index, a blockchain-based metric measuring a researcher's scientific contributions.
To incentivize researchers to share their data, the SCIENCE-index is augmented to include a data-sharing parameter.
Our model is evaluated by comparing the distribution of its output for geographically diverse researchers to that of the h-index.
arXiv Detail & Related papers (2023-03-18T19:17:47Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Research Trends and Applications of Data Augmentation Algorithms [77.34726150561087]
We identify the main areas of application of data augmentation algorithms, the types of algorithms used, significant research trends, their progression over time and research gaps in data augmentation literature.
We expect readers to understand the potential of data augmentation, as well as identify future research directions and open questions within data augmentation research.
arXiv Detail & Related papers (2022-07-18T11:38:32Z) - Librarian-in-the-Loop: A Natural Language Processing Paradigm for
Detecting Informal Mentions of Research Data in Academic Literature [1.4190701053683017]
We propose a natural language processing paradigm to support the human task of identifying informal mentions made to research datasets.
The work of discovering informal data mentions is currently performed by librarians and their staff in the Inter-university Consortium for Political and Social Research.
arXiv Detail & Related papers (2022-03-10T02:11:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.