Peek Across: Improving Multi-Document Modeling via Cross-Document
Question-Answering
- URL: http://arxiv.org/abs/2305.15387v1
- Date: Wed, 24 May 2023 17:48:40 GMT
- Title: Peek Across: Improving Multi-Document Modeling via Cross-Document
Question-Answering
- Authors: Avi Caciularu, Matthew E. Peters, Jacob Goldberger, Ido Dagan, Arman
Cohan
- Abstract summary: We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective.
This novel multi-document QA formulation directs the model to better recover cross-text informational relations.
Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
- Score: 49.85790367128085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The integration of multi-document pre-training objectives into language
models has resulted in remarkable improvements in multi-document downstream
tasks. In this work, we propose extending this idea by pre-training a generic
multi-document model from a novel cross-document question answering
pre-training objective. To that end, given a set (or cluster) of
topically-related documents, we systematically generate semantically-oriented
questions from a salient sentence in one document and challenge the model,
during pre-training, to answer these questions while "peeking" into other
topically-related documents. In a similar manner, the model is also challenged
to recover the sentence from which the question was generated, again while
leveraging cross-document information. This novel multi-document QA formulation
directs the model to better recover cross-text informational relations, and
introduces a natural augmentation that artificially increases the pre-training
data. Further, unlike prior multi-document models that focus on either
classification or summarization tasks, our pre-training objective formulation
enables the model to perform tasks that involve both short text generation
(e.g., QA) and long text generation (e.g., summarization). Following this
scheme, we pre-train our model -- termed QAmden -- and evaluate its performance
across several multi-document tasks, including multi-document QA,
summarization, and query-focused summarization, yielding improvements of up to
7%, and significantly outperforms zero-shot GPT-3.5 and GPT-4.
Related papers
- On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Retrieval-Generation Synergy Augmented Large Language Models [30.53260173572783]
We propose an iterative retrieval-generation collaborative framework.
We conduct experiments on four question answering datasets, including single-hop QA and multi-hop QA tasks.
arXiv Detail & Related papers (2023-10-08T12:50:57Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - Extending Multi-Text Sentence Fusion Resources via Pyramid Annotations [12.394777121890925]
This paper revisits and substantially extends previous dataset creation efforts.
We show that our extended version uses more representative texts for multi-document tasks and provides a larger and more diverse training set.
arXiv Detail & Related papers (2021-10-09T09:15:05Z) - Sequential Cross-Document Coreference Resolution [14.099694053823765]
Cross-document coreference resolution is important for the growing interest in multi-document analysis tasks.
We propose a new model that extends the efficient sequential prediction paradigm for coreference resolution to cross-document settings.
Our model incrementally composes mentions into cluster representations and predicts links between a mention and the already constructed clusters.
arXiv Detail & Related papers (2021-04-17T00:46:57Z) - Cross-Document Language Modeling [28.34202232940097]
Cross-document language model (CD-LM) improves masked language modeling for multi-document NLP tasks.
We show that our CD-LM sets new state-of-the-art results for several multi-text tasks.
arXiv Detail & Related papers (2021-01-02T09:01:39Z) - WSL-DS: Weakly Supervised Learning with Distant Supervision for Query
Focused Multi-Document Abstractive Summarization [16.048329028104643]
In the Query Focused Multi-Document Summarization (QF-MDS) task, a set of documents and a query are given where the goal is to generate a summary from these documents.
One major challenge for this task is the lack of availability of labeled training datasets.
We propose a novel weakly supervised learning approach via utilizing distant supervision.
arXiv Detail & Related papers (2020-11-03T02:02:55Z) - Pre-training via Paraphrasing [96.79972492585112]
We introduce MARGE, a pre-trained sequence-to-sequence model learned with an unsupervised multi-lingual paraphrasing objective.
We show it is possible to jointly learn to do retrieval and reconstruction, given only a random initialization.
For example, with no additional task-specific training we achieve BLEU scores of up to 35.8 for document translation.
arXiv Detail & Related papers (2020-06-26T14:43:43Z) - Pre-training for Abstractive Document Summarization by Reinstating
Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text.
Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z) - Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem.
Given a query (e.g., a question), return the set of relevant documents from a large document corpus.
We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.