Related papers: A Library Perspective on Supervised Text Processing in Digital Libraries: An Investigation in the Biomedical Domain

A Library Perspective on Supervised Text Processing in Digital Libraries: An Investigation in the Biomedical Domain

URL: http://arxiv.org/abs/2411.12752v1
Date: Wed, 06 Nov 2024 07:54:10 GMT
Title: A Library Perspective on Supervised Text Processing in Digital Libraries: An Investigation in the Biomedical Domain
Authors: Hermann Kroll, Pascal Sackhoff, Bill Matthias Thang, Maha Ksouri, Wolf-Tilo Balke,
Abstract summary: We focus on relation extraction and text classification, using the showcase of eight biomedical benchmarks. We consider trade-offs between accuracy and application costs, dive into training data generation through distant supervision and large language models such as ChatGPT, LLama, and Olmo, and discuss how to design final pipelines.
Score: 3.9519587827662397
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Digital libraries that maintain extensive textual collections may want to further enrich their content for certain downstream applications, e.g., building knowledge graphs, semantic enrichment of documents, or implementing novel access paths. All of these applications require some text processing, either to identify relevant entities, extract semantic relationships between them, or to classify documents into some categories. However, implementing reliable, supervised workflows can become quite challenging for a digital library because suitable training data must be crafted, and reliable models must be trained. While many works focus on achieving the highest accuracy on some benchmarks, we tackle the problem from a digital library practitioner. In other words, we also consider trade-offs between accuracy and application costs, dive into training data generation through distant supervision and large language models such as ChatGPT, LLama, and Olmo, and discuss how to design final pipelines. Therefore, we focus on relation extraction and text classification, using the showcase of eight biomedical benchmarks.

Related papers

FicSim: A Dataset for Multi-Faceted Semantic Similarity in Long-Form Fiction [11.216252240451183]
We release FICSIM, a dataset of long-form, recently written fiction, including scores along 12 axes of similarity informed by author-produced metadata.<n>We evaluate a suite of embedding models on this task, demonstrating a tendency across models to focus on surface-level features over semantic categories.
arXiv Detail & Related papers (2025-10-23T18:30:19Z)
Vector embedding of multi-modal texts: a tool for discovery? [0.45880283710344055]
This study explores the extent to which vector-based multimodal retrieval can improve discovery across multi-modal (text and images) content.<n>We use over 3,600 digitized textbook pages largely from computer science textbooks and a Vision Language Model (VLM)<n>We issue a benchmark of 75 natural language queries and compare retrieval performance to ground truth and across four similarity (distance) measures.
arXiv Detail & Related papers (2025-09-10T01:14:48Z)
Chatting with Papers: A Hybrid Approach Using LLMs and Knowledge Graphs [3.68389405018277]
This demo paper reports on a new workflow textitGhostWriter that combines the use of Large Language Models and Knowledge Graphs to support navigation through collections.<n>Based on the tool-suite textitEverythingData at the backend, textitGhostWriter provides an interface that enables querying and chatting'' with a collection.
arXiv Detail & Related papers (2025-05-16T18:51:51Z)
Generative Compositor for Few-Shot Visual Information Extraction [60.663887314625164]
We propose a novel generative model, named Generative generative spatialtor, to address the challenge of few-shot VIE. Generative generative spatialtor is a hybrid pointer-generator network that emulates the operations of a compositor by retrieving words from the source text. The proposed method achieves highly competitive results in the full-sample training, while notably outperforms the baseline in the 1-shot, 5-shot, and 10-shot settings.
arXiv Detail & Related papers (2025-03-21T04:56:24Z)
Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking [58.69615583599489]
Deliberate Thinking based Retriever (Debater) is a novel approach that enhances document representations by incorporating a step-by-step thinking process.<n>Debater significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z)
Synthetic continued pretraining [29.6872772403251]
We propose synthetic continued pretraining on a small corpus of domain-specific documents. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm. We show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
arXiv Detail & Related papers (2024-09-11T17:21:59Z)
iText2KG: Incremental Knowledge Graphs Construction Using Large Language Models [0.7165255458140439]
iText2KG is a method for incremental, topic-independent Knowledge Graph construction without post-processing. Our method demonstrates superior performance compared to baseline methods across three scenarios.
arXiv Detail & Related papers (2024-09-05T06:49:14Z)
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains. We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z)
U-DIADS-Bib: a full and few-shot pixel-precise dataset for document layout analysis of ancient manuscripts [9.76730765089929]
U-DIADS-Bib is a novel, pixel-precise, non-overlapping and noiseless document layout analysis dataset developed in close collaboration between specialists in the fields of computer vision and humanities. We propose a novel, computer-aided, segmentation pipeline in order to alleviate the burden represented by the time-consuming process of manual annotation.
arXiv Detail & Related papers (2024-01-16T15:11:18Z)
On Task-personalized Multimodal Few-shot Learning for Visually-rich Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications. FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER. We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z)
Leveraging Contextual Information for Effective Entity Salience Detection [21.30389576465761]
We show that fine-tuning medium-sized language models with a cross-encoder style architecture yields substantial performance gains over feature engineering approaches. We also show that zero-shot prompting of instruction-tuned language models yields inferior results, indicating the task's uniqueness and complexity.
arXiv Detail & Related papers (2023-09-14T19:04:40Z)
SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation [15.953725529361874]
Document layout analysis is a known problem to the documents research community. With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain. We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches.
arXiv Detail & Related papers (2023-05-01T12:47:55Z)
Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z)
SPECTER: Document-level Representation Learning using Citation-informed Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model. We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
Learning Contextualized Document Representations for Healthcare Answer Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents. Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse. We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.