PubSqueezer: A Text-Mining Web Tool to Transform Unstructured Documents
into Structured Data
- URL: http://arxiv.org/abs/2011.03123v2
- Date: Mon, 9 Nov 2020 07:51:35 GMT
- Title: PubSqueezer: A Text-Mining Web Tool to Transform Unstructured Documents
into Structured Data
- Authors: Alberto Calderone
- Abstract summary: I present a web tool which uses a Text Mining strategy to transform unstructured biomedical articles into structured data.
generated results give a quick overview on complex topics which can possibly suggest not explicitly reported information.
I show how a literature based analysis conducted with PubSqueezer results allows to describe known facts about SARS-CoV-2.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The amount of scientific papers published every day is daunting and
constantly increasing. Keeping up with literature represents a challenge. If
one wants to start exploring new topics it is hard to have a big picture
without reading lots of articles. Furthermore, as one reads through literature,
making mental connections is crucial to ask new questions which might lead to
discoveries. In this work, I present a web tool which uses a Text Mining
strategy to transform large collections of unstructured biomedical articles
into structured data. Generated results give a quick overview on complex topics
which can possibly suggest not explicitly reported information. In particular,
I show two Data Science analyses. First, I present a literature based rare
diseases network build using this tool in the hope that it will help clarify
some aspects of these less popular pathologies. Secondly, I show how a
literature based analysis conducted with PubSqueezer results allows to describe
known facts about SARS-CoV-2. In one sentence, data generated with PubSqueezer
make it easy to use scientific literate in any computational analysis such as
machine learning, natural language processing etc.
Availability: http://www.pubsqueezer.com
Related papers
- Contri(e)ve: Context + Retrieve for Scholarly Question Answering [0.0]
We present a two step solution using open source Large Language Model(LLM): Llama3.1 for Scholarly-QALD dataset.
Firstly, we extract the context pertaining to the question from different structured and unstructured data sources.
Secondly, we implement prompt engineering to improve the information retrieval performance of the LLM.
arXiv Detail & Related papers (2024-09-13T17:38:47Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Interactive Distillation of Large Single-Topic Corpora of Scientific
Papers [1.2954493726326113]
A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert handpicks documents.
Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature.
arXiv Detail & Related papers (2023-09-19T17:18:36Z) - Data-Driven Information Extraction and Enrichment of Molecular Profiling
Data for Cancer Cell Lines [1.1999555634662633]
This work presents the design, implementation and application of a novel data extraction and exploration system.
We introduce a new public data exploration portal, which enables automatic linking of genomic copy number variants plots with ranked, related entities.
Our system is publicly available on the web at https://cancercelllines.org.
arXiv Detail & Related papers (2023-07-03T11:15:42Z) - The Semantic Scholar Open Data Platform [79.4493235243312]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.
We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.
The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Structured information extraction from complex scientific text with
fine-tuned large language models [55.96705756327738]
We present a simple sequence-to-sequence approach to joint named entity recognition and relation extraction.
The approach leverages a pre-trained large language model (LLM), GPT-3, that is fine-tuned on approximately 500 pairs of prompts.
This approach represents a simple, accessible, and highly-flexible route to obtaining large databases of structured knowledge extracted from unstructured text.
arXiv Detail & Related papers (2022-12-10T07:51:52Z) - WikiDes: A Wikipedia-Based Dataset for Generating Short Descriptions
from Paragraphs [66.88232442007062]
We introduce WikiDes, a dataset to generate short descriptions of Wikipedia articles.
The dataset consists of over 80k English samples on 6987 topics.
Our paper shows a practical impact on Wikipedia and Wikidata since there are thousands of missing descriptions.
arXiv Detail & Related papers (2022-09-27T01:28:02Z) - CitationIE: Leveraging the Citation Graph for Scientific Information
Extraction [89.33938657493765]
We use the citation graph of referential links between citing and cited papers.
We observe a sizable improvement in end-to-end information extraction over the state-of-the-art.
arXiv Detail & Related papers (2021-06-03T03:00:12Z) - How to Train Your Agent to Read and Write [52.24605794920856]
Reading and writing research papers is one of the most privileged abilities that a qualified researcher should master.
It would be fascinating if we could train an intelligent agent to help people read and summarize papers, and perhaps even discover and exploit the potential knowledge clues to write novel papers.
We propose a Deep ReAder-Writer (DRAW) network, which consists of a textitReader that can extract knowledge graphs (KGs) from input paragraphs and discover potential knowledge, a graph-to-text textitWriter that generates a novel paragraph, and a textit
arXiv Detail & Related papers (2021-01-04T12:22:04Z) - Machine Identification of High Impact Research through Text and Image
Analysis [0.4737991126491218]
We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations.
Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions.
arXiv Detail & Related papers (2020-05-20T19:12:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.