PyTorch-IE: Fast and Reproducible Prototyping for Information Extraction
- URL: http://arxiv.org/abs/2406.00007v1
- Date: Thu, 16 May 2024 12:23:37 GMT
- Title: PyTorch-IE: Fast and Reproducible Prototyping for Information Extraction
- Authors: Arne Binder, Leonhard Hennig, Christoph Alt,
- Abstract summary: PyTorch-IE is a framework designed to enable swift, reproducible, and reusable implementations of Information Extraction models.
We propose task modules to decouple the concerns of data representation and model-specific representations.
PyTorch-IE also extends support for widely used libraries such as PyTorch-Lightning for training, HuggingFace datasets for dataset reading, and Hydra for experiment configuration.
- Score: 6.308539010172309
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The objective of Information Extraction (IE) is to derive structured representations from unstructured or semi-structured documents. However, developing IE models is complex due to the need of integrating several subtasks. Additionally, representation of data among varied tasks and transforming datasets into task-specific model inputs presents further challenges. To streamline this undertaking for researchers, we introduce PyTorch-IE, a deep-learning-based framework uniquely designed to enable swift, reproducible, and reusable implementations of IE models. PyTorch-IE offers a flexible data model capable of creating complex data structures by integrating interdependent layers of annotations derived from various data types, like plain text or semi-structured text, and even images. We propose task modules to decouple the concerns of data representation and model-specific representations, thereby fostering greater flexibility and reusability of code. PyTorch-IE also extends support for widely used libraries such as PyTorch-Lightning for training, HuggingFace datasets for dataset reading, and Hydra for experiment configuration. Supplementary libraries and GitHub templates for the easy setup of new projects are also provided. By ensuring functionality and versatility, PyTorch-IE provides vital support to the research community engaged in Information Extraction.
Related papers
- A Survey on Deep Tabular Learning [0.0]
Tabular data presents unique challenges for deep learning due to its heterogeneous nature and lack of spatial structure.
This survey reviews the evolution of deep learning models for Tabular data, from early fully connected networks (FCNs) to advanced architectures like TabNet, SAINT, TabTranSELU, and MambaNet.
arXiv Detail & Related papers (2024-10-15T20:08:08Z) - CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets.
We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents.
We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - PyTorch Frame: A Modular Framework for Multi-Modal Tabular Learning [54.912520425218496]
We present PyTorch Frame, a PyTorch-based framework for deep learning over multi-modal tabular data.
We demonstrate the usefulness of PyTorch Frame by implementing diverse models in a modular way.
We integrate PyTorch Frame with PyTorch Geometric, a PyTorch library for Graph Neural Networks (GNNs), to perform end-to-end learning over relational databases.
arXiv Detail & Related papers (2024-03-31T19:15:09Z) - Mirror: A Universal Framework for Various Information Extraction Tasks [28.43708291298155]
We propose a universal framework for various IE tasks, namely Mirror.
We recast existing IE tasks as a multi-span cyclic graph extraction problem and devise a non-autoregressive graph decoding algorithm.
Our model has decent compatibility and outperforms or reaches competitive performance with SOTA systems.
arXiv Detail & Related papers (2023-11-09T14:58:46Z) - Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users.
We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set.
Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z) - Easy-to-Hard Learning for Information Extraction [57.827955646831526]
Information extraction systems aim to automatically extract structured information from unstructured texts.
We propose a unified easy-to-hard learning framework consisting of three stages, i.e., the easy stage, the hard stage, and the main stage.
By breaking down the learning process into multiple stages, our framework facilitates the model to acquire general IE task knowledge and improve its generalization ability.
arXiv Detail & Related papers (2023-05-16T06:04:14Z) - Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets.
Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders.
We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.