Logic Mill -- A Knowledge Navigation System
- URL: http://arxiv.org/abs/2301.00200v2
- Date: Fri, 20 Oct 2023 10:00:03 GMT
- Title: Logic Mill -- A Knowledge Navigation System
- Authors: Sebastian Erhardt, Mainak Ghosh, Erik Buunk, Michael E. Rose, Dietmar
Harhoff
- Abstract summary: Logic Mill is a scalable and openly accessible software system that identifies semantically similar documents.
It uses advanced Natural Language Processing (NLP) techniques to generate numerical representations of documents.
The system focuses on scientific publications and patent documents and contains more than 200 million documents.
- Score: 0.16785092703248325
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Logic Mill is a scalable and openly accessible software system that
identifies semantically similar documents within either one domain-specific
corpus or multi-domain corpora. It uses advanced Natural Language Processing
(NLP) techniques to generate numerical representations of documents. Currently
it leverages a large pre-trained language model to generate these document
representations. The system focuses on scientific publications and patent
documents and contains more than 200 million documents. It is easily accessible
via a simple Application Programming Interface (API) or via a web interface.
Moreover, it is continuously being updated and can be extended to text corpora
from other domains. We see this system as a general-purpose tool for future
research applications in the social sciences and other domains.
Related papers
- DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems [99.17123445211115]
We introduce DocBench, a benchmark to evaluate large language model (LLM)-based document reading systems.
Our benchmark involves the recruitment of human annotators and the generation of synthetic questions.
It includes 229 real documents and 1,102 questions, spanning across five different domains and four major types of questions.
arXiv Detail & Related papers (2024-07-15T13:17:42Z) - DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models [63.466265039007816]
We present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community.
We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.
arXiv Detail & Related papers (2024-06-17T15:13:52Z) - Knowledge-Driven Cross-Document Relation Extraction [3.868708275322908]
Relation extraction (RE) is a well-known NLP application often treated as a sentence- or document-level task.
We propose a novel approach, KXDocRE, that embed domain knowledge of entities with input text for cross-document RE.
arXiv Detail & Related papers (2024-05-22T11:30:59Z) - Domain-specific ChatBots for Science using Embeddings [0.5687661359570725]
Large language models (LLMs) have emerged as powerful machine-learning systems capable of handling a myriad of tasks.
Here, we demonstrate how existing methods and software tools can be easily combined to yield a domain-specific chatbots.
arXiv Detail & Related papers (2023-06-15T15:26:20Z) - An Empirical Investigation into the Use of Image Captioning for
Automated Software Documentation [17.47243004709207]
This paper investigates the connection between Graphical User Interfaces and functional, natural language descriptions of software.
We collect, analyze, and open source a large dataset of functional GUI descriptions consisting of 45,998 descriptions for 10,204 screenshots from popular Android applications.
To gain insight into the representational potential of GUIs, we investigate the ability of four Neural Image Captioning models to predict natural language descriptions of varying granularity when provided a screenshot as input.
arXiv Detail & Related papers (2023-01-03T17:15:18Z) - Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation.
Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Combining Deep Learning and Reasoning for Address Detection in
Unstructured Text Documents [0.0]
We propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents.
We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images.
arXiv Detail & Related papers (2022-02-07T12:32:00Z) - Open Domain Question Answering over Virtual Documents: A Unified
Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA)
Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources.
We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z) - Kleister: A novel task for Information Extraction involving Long
Documents with Complex Layout [5.8530995077744645]
We introduce a new task (named Kleister) with two new datasets.
An NLP system must find the most important information, about various types of entities, in long formal documents.
We propose Pipeline method as a text-only baseline with different Named Entity Recognition architectures.
arXiv Detail & Related papers (2020-03-04T22:45:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.