Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical
Encoder for Long-Form Document Matching
- URL: http://arxiv.org/abs/2004.12297v2
- Date: Tue, 13 Oct 2020 01:48:52 GMT
- Title: Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical
Encoder for Long-Form Document Matching
- Authors: Liu Yang, Mingyang Zhang, Cheng Li, Michael Bendersky, Marc Najork
- Abstract summary: We propose a Siamese Multi-depth Transformer-based SMITH for long-form document matching.
Our model contains several innovations to adapt self-attention models for longer text input.
We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.
- Score: 28.190001111358438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many natural language processing and information retrieval problems can be
formalized as the task of semantic matching. Existing work in this area has
been largely focused on matching between short texts (e.g., question
answering), or between a short and a long text (e.g., ad-hoc retrieval).
Semantic matching between long-form documents, which has many important
applications like news recommendation, related article recommendation and
document clustering, is relatively less explored and needs more research
effort. In recent years, self-attention based models like Transformers and BERT
have achieved state-of-the-art performance in the task of text matching. These
models, however, are still limited to short text like a few sentences or one
paragraph due to the quadratic computational complexity of self-attention with
respect to input text length. In this paper, we address the issue by proposing
the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for
long-form document matching. Our model contains several innovations to adapt
self-attention models for longer text input. In order to better capture
sentence level semantic relations within a document, we pre-train the model
with a novel masked sentence block language modeling task in addition to the
masked word language modeling task used by BERT. Our experimental results on
several benchmark datasets for long-form document matching show that our
proposed SMITH model outperforms the previous state-of-the-art models including
hierarchical attention, multi-depth attention-based hierarchical recurrent
neural network, and BERT. Comparing to BERT based baselines, our model is able
to increase maximum input text length from 512 to 2048. We will open source a
Wikipedia based benchmark dataset, code and a pre-trained checkpoint to
accelerate future research on long-form document matching.
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model [22.07414287186125]
Quest is a query-centric data method aggregating semantically relevant yet diverse documents.
It uses a generative model to predict potential queries for each document, grouping documents with similar queries and keywords.
Experiments demonstrate Quest's superior performance on long-context tasks, achieving remarkable results with context lengths of up to 1M tokens.
arXiv Detail & Related papers (2024-05-30T08:50:55Z) - LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs.
With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z) - JOIST: A Joint Speech and Text Streaming Model For ASR [63.15848310748753]
We present JOIST, an algorithm to train a streaming, cascaded, encoder end-to-end (E2E) model with both speech-text paired inputs, and text-only unpaired inputs.
We find that best text representation for JOIST improves WER across a variety of search and rare-word test sets by 4-14% relative, compared to a model not trained with text.
arXiv Detail & Related papers (2022-10-13T20:59:22Z) - Adapting Pretrained Text-to-Text Models for Long Text Sequences [39.62224414485055]
We adapt an existing pretrained text-to-text model for long-sequence inputs.
We build a long-context model that achieves competitive performance on long-text QA tasks.
arXiv Detail & Related papers (2022-09-21T00:41:07Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - Hierarchical Neural Network Approaches for Long Document Classification [3.6700088931938835]
We employ pre-trained Universal Sentence (USE) and Bidirectional Representations from Transformers (BERT) in a hierarchical setup to capture better representations efficiently.
Our proposed models are conceptually simple where we divide the input data into chunks and then pass this through base models of BERT and USE.
We show that USE + CNN/LSTM performs better than its stand-alone baseline. Whereas the BERT + CNN/LSTM performs on par with its stand-alone counterpart.
arXiv Detail & Related papers (2022-01-18T07:17:40Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Cross-Document Language Modeling [28.34202232940097]
Cross-document language model (CD-LM) improves masked language modeling for multi-document NLP tasks.
We show that our CD-LM sets new state-of-the-art results for several multi-text tasks.
arXiv Detail & Related papers (2021-01-02T09:01:39Z) - ERNIE-DOC: The Retrospective Long-Document Modeling Transformer [24.426571160930635]
We propose ERNIE-DOC, a document-level language pretraining model based on Recurrence Transformers.
Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism enable ERNIE-DOC with much longer effective context length.
Various experiments on both English and Chinese document-level tasks are conducted.
arXiv Detail & Related papers (2020-12-31T16:12:48Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.