Related papers: Query-Based Keyphrase Extraction from Long Documents

Query-Based Keyphrase Extraction from Long Documents

URL: http://arxiv.org/abs/2205.05391v1
Date: Wed, 11 May 2022 10:29:30 GMT
Title: Query-Based Keyphrase Extraction from Long Documents
Authors: Martin Docekal, Pavel Smrz
Abstract summary: This paper overcomes issue for keyphrase extraction by chunking the long documents. System employs a pre-trained BERT model and adapts it to estimate the probability that a given text span forms a keyphrase.
Score: 4.823229052465654
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Transformer-based architectures in natural language processing force input size limits that can be problematic when long documents need to be processed. This paper overcomes this issue for keyphrase extraction by chunking the long documents while keeping a global context as a query defining the topic for which relevant keyphrases should be extracted. The developed system employs a pre-trained BERT model and adapts it to estimate the probability that a given text span forms a keyphrase. We experimented using various context sizes on two popular datasets, Inspec and SemEval, and a large novel dataset. The presented results show that a shorter context with a query overcomes a longer one without the query on long documents.

Related papers

HERA: Improving Long Document Summarization using Large Language Models with Context Packaging and Reordering [6.876612430571396]
We propose a novel summary generation framework, called HERA. We first segment a long document by its semantic structure and retrieve text segments about the same event, and finally reorder them to form the input context. The experimental results show that HERA outperforms foundation models in ROUGE, BERTScore and faithfulness metrics.
arXiv Detail & Related papers (2025-02-01T14:55:06Z)
LongKey: Keyphrase Extraction for Long Documents [3.832358080820378]
LongKey is a novel framework for extracting keyphrases from lengthy documents. LongKey consistently outperforms existing unsupervised and language model-based keyphrase extraction methods.
arXiv Detail & Related papers (2024-11-26T20:26:47Z)
Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model [22.07414287186125]
Quest is a query-centric data method aggregating semantically relevant yet diverse documents. It uses a generative model to predict potential queries for each document, grouping documents with similar queries and keywords. Experiments demonstrate Quest's superior performance on long-context tasks, achieving remarkable results with context lengths of up to 1M tokens.
arXiv Detail & Related papers (2024-05-30T08:50:55Z)
In-context Pretraining: Language Modeling Beyond Document Boundaries [137.53145699439898]
In-Context Pretraining is a new approach where language models are pretrained on a sequence of related documents. We introduce approximate algorithms for finding related documents with efficient nearest neighbor search. We see notable improvements in tasks that require more complex contextual reasoning.
arXiv Detail & Related papers (2023-10-16T17:57:12Z)
PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. We propose PDFTriage that enables models to retrieve the context based on either structure or content. Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z)
DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR) While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z)
Generate rather than Retrieve: Large Language Models are Strong Context Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators. We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z)
Multi-Document Keyphrase Extraction: A Literature Review and the First Dataset [24.91326715164367]
Multi-document keyphrase extraction has been infrequently studied, despite its utility for describing sets of documents. We present here the first literature review and the first dataset for the task, MK-DUC-01, which can serve as a new benchmark.
arXiv Detail & Related papers (2021-10-03T19:10:28Z)
Text Summarization with Latent Queries [60.468323530248945]
We introduce LaQSum, the first unified text summarization system that learns Latent Queries from documents for abstractive summarization with any existing query forms. Under a deep generative framework, our system jointly optimize a latent query model and a conditional language model, allowing users to plug-and-play queries of any type at test time. Our system robustly outperforms strong comparison systems across summarization benchmarks with different query types, document settings, and target domains.
arXiv Detail & Related papers (2021-05-31T21:14:58Z)
Open Question Answering over Tables and Text [55.8412170633547]
In open question answering (QA), the answer to a question is produced by retrieving and then analyzing documents that might contain answers to the question. Most open QA systems have considered only retrieving information from unstructured text. We present a new large-scale dataset Open Table-and-Text Question Answering (OTT-QA) to evaluate performance on this task.
arXiv Detail & Related papers (2020-10-20T16:48:14Z)
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching [28.190001111358438]
We propose a Siamese Multi-depth Transformer-based SMITH for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.
arXiv Detail & Related papers (2020-04-26T07:04:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.