Long Document Ranking with Query-Directed Sparse Transformer
- URL: http://arxiv.org/abs/2010.12683v1
- Date: Fri, 23 Oct 2020 21:57:56 GMT
- Title: Long Document Ranking with Query-Directed Sparse Transformer
- Authors: Jyun-Yu Jiang, Chenyan Xiong, Chia-Jung Lee and Wei Wang
- Abstract summary: We design Query-Directed Sparse attention that induces IR-axiomatic structures in transformer self-attention.
Our model, QDS-Transformer, enforces the principle properties desired in ranking.
Experiments on one fully supervised and three few-shot TREC document ranking benchmarks demonstrate the consistent and robust advantage of QDS-Transformer.
- Score: 30.997237454078526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The computing cost of transformer self-attention often necessitates breaking
long documents to fit in pretrained models in document ranking tasks. In this
paper, we design Query-Directed Sparse attention that induces IR-axiomatic
structures in transformer self-attention. Our model, QDS-Transformer, enforces
the principle properties desired in ranking: local contextualization,
hierarchical representation, and query-oriented proximity matching, while it
also enjoys efficiency from sparsity. Experiments on one fully supervised and
three few-shot TREC document ranking benchmarks demonstrate the consistent and
robust advantage of QDS-Transformer over previous approaches, as they either
retrofit long documents into BERT or use sparse attention without emphasizing
IR principles. We further quantify the computing complexity and demonstrates
that our sparse attention with TVM implementation is twice more efficient than
the fully-connected self-attention. All source codes, trained model, and
predictions of this work are available at
https://github.com/hallogameboy/QDS-Transformer.
Related papers
- PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - Long-Range Transformer Architectures for Document Understanding [1.9331361036118608]
Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019.
We introduce 2 new multi-modal (text + layout) long-range models for DU based on efficient implementations of Transformers for long sequences.
Relative 2D attention revealed to be effective on dense text for both normal and long-range models.
arXiv Detail & Related papers (2023-09-11T14:45:24Z) - Robust representations of oil wells' intervals via sparse attention
mechanism [2.604557228169423]
We introduce the class of efficient Transformers named Regularized Transformers (Reguformers)
The focus in our experiments is on oil&gas data, namely, well logs.
To evaluate our models for such problems, we work with an industry-scale open dataset consisting of well logs of more than 20 wells.
arXiv Detail & Related papers (2022-12-29T09:56:33Z) - Retrieval as Attention: End-to-end Learning of Retrieval and Reading
within a Single Transformer [80.50327229467993]
We show that a single model trained end-to-end can achieve both competitive retrieval and QA performance.
We show that end-to-end adaptation significantly boosts its performance on out-of-domain datasets in both supervised and unsupervised settings.
arXiv Detail & Related papers (2022-12-05T04:51:21Z) - Resource-Efficient Separation Transformer [14.666016177212837]
This paper explores Transformer-based speech separation with a reduced computational cost.
Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture.
The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings.
arXiv Detail & Related papers (2022-06-19T23:37:24Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Improving Transformer-Kernel Ranking Model Using Conformer and Query
Term Independence [29.442579683405913]
The Transformer- Kernel (TK) model has demonstrated strong reranking performance on the TREC Deep Learning benchmark.
A variant of the TK model -- called TKL -- has been developed that incorporates local self-attention to efficiently process longer input sequences.
In this work, we propose a novel Conformer layer as an alternative approach to scale TK to longer input sequences.
arXiv Detail & Related papers (2021-04-19T15:32:34Z) - Random Feature Attention [69.4671822971207]
We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function.
RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism.
Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines.
arXiv Detail & Related papers (2021-03-03T02:48:56Z) - Conformer-Kernel with Query Term Independence for Document Retrieval [32.36908635150144]
The Transformer- Kernel (TK) model has demonstrated strong reranking performance on the TREC Deep Learning benchmark.
We extend the TK architecture to the full retrieval setting by incorporating the query term independence assumption.
We show that the Conformer's GPU memory requirement scales linearly with input sequence length, making it a more viable option when ranking long documents.
arXiv Detail & Related papers (2020-07-20T19:47:28Z) - Pre-training Tasks for Embedding-based Large-scale Retrieval [68.01167604281578]
We consider the large-scale query-document retrieval problem.
Given a query (e.g., a question), return the set of relevant documents from a large document corpus.
We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks.
arXiv Detail & Related papers (2020-02-10T16:44:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.