Utilizing BERT for Information Retrieval: Survey, Applications,
Resources, and Challenges
- URL: http://arxiv.org/abs/2403.00784v1
- Date: Sun, 18 Feb 2024 23:22:40 GMT
- Title: Utilizing BERT for Information Retrieval: Survey, Applications,
Resources, and Challenges
- Authors: Jiajia Wang, Jimmy X. Huang, Xinhui Tu, Junmei Wang, Angela J. Huang,
Md Tahmid Rahman Laskar, Amran Bhuiyan
- Abstract summary: This survey focuses on approaches that apply pretrained transformer encoders like BERT to information retrieval (IR)
We group them into six high-level categories: (i) handling long documents, (ii) integrating semantic information, (iii) balancing effectiveness and efficiency, (iv) predicting the weights of terms, (v) query expansion, and (vi) document expansion.
We find that for specific tasks, finely tuned BERT encoders still outperform, and at a lower deployment cost.
- Score: 4.588192657854766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have witnessed a substantial increase in the use of deep
learning to solve various natural language processing (NLP) problems. Early
deep learning models were constrained by their sequential or unidirectional
nature, such that they struggled to capture the contextual relationships across
text inputs. The introduction of bidirectional encoder representations from
transformers (BERT) leads to a robust encoder for the transformer model that
can understand the broader context and deliver state-of-the-art performance
across various NLP tasks. This has inspired researchers and practitioners to
apply BERT to practical problems, such as information retrieval (IR). A survey
that focuses on a comprehensive analysis of prevalent approaches that apply
pretrained transformer encoders like BERT to IR can thus be useful for academia
and the industry. In light of this, we revisit a variety of BERT-based methods
in this survey, cover a wide range of techniques of IR, and group them into six
high-level categories: (i) handling long documents, (ii) integrating semantic
information, (iii) balancing effectiveness and efficiency, (iv) predicting the
weights of terms, (v) query expansion, and (vi) document expansion. We also
provide links to resources, including datasets and toolkits, for BERT-based IR
systems. A key highlight of our survey is the comparison between BERT's
encoder-based models and the latest generative Large Language Models (LLMs),
such as ChatGPT, which rely on decoders. Despite the popularity of LLMs, we
find that for specific tasks, finely tuned BERT encoders still outperform, and
at a lower deployment cost. Finally, we summarize the comprehensive outcomes of
the survey and suggest directions for future research in the area.
Related papers
- Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report [3.4632900249241874]
This paper presents an experience report on the development of Retrieval Augmented Generation (RAG) systems using PDF documents as the primary data source.
The RAG architecture combines generative capabilities of Large Language Models (LLMs) with the precision of information retrieval.
The practical implications of this research lie in enhancing the reliability of generative AI systems in various sectors.
arXiv Detail & Related papers (2024-10-21T12:21:49Z) - A Comprehensive Survey on Applications of Transformers for Deep Learning
Tasks [60.38369406877899]
Transformer is a deep neural network that employs a self-attention mechanism to comprehend the contextual relationships within sequential data.
transformer models excel in handling long dependencies between input sequence elements and enable parallel processing.
Our survey encompasses the identification of the top five application domains for transformer-based models.
arXiv Detail & Related papers (2023-06-11T23:13:51Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are
Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER.
MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z) - BLISS: Robust Sequence-to-Sequence Learning via Self-Supervised Input
Representation [92.75908003533736]
We propose a framework-level robust sequence-to-sequence learning approach, named BLISS, via self-supervised input representation.
We conduct comprehensive experiments to validate the effectiveness of BLISS on various tasks, including machine translation, grammatical error correction, and text summarization.
arXiv Detail & Related papers (2022-04-16T16:19:47Z) - Diagnosing BERT with Retrieval Heuristics [8.299945169799793]
"vanilla BERT" has been shown to outperform existing retrieval algorithms by a wide margin.
In this paper, we employ the recently proposed axiomatic dataset analysis technique.
We find BERT, when applied to a recently released large-scale web corpus with ad-hoc topics, to emphnot adhere to any of the explored axioms.
arXiv Detail & Related papers (2022-01-12T13:11:17Z) - Transferring BERT-like Transformers' Knowledge for Authorship
Verification [8.443350618722562]
We study the effectiveness of several BERT-like transformers for the task of authorship verification.
We provide new splits for PAN-2020, where training and test data are sampled from disjoint topics or authors.
We show that those splits can enhance the models' capability to transfer knowledge over a new, significantly different dataset.
arXiv Detail & Related papers (2021-12-09T18:57:29Z) - Pretrained Transformers for Text Ranking: BERT and Beyond [53.83210899683987]
This survey provides an overview of text ranking with neural network architectures known as transformers.
The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing.
arXiv Detail & Related papers (2020-10-13T15:20:32Z) - ENT-DESC: Entity Description Generation by Exploring Knowledge Graph [53.03778194567752]
In practice, the input knowledge could be more than enough, since the output description may only cover the most significant knowledge.
We introduce a large-scale and challenging dataset to facilitate the study of such a practical scenario in KG-to-text.
We propose a multi-graph structure that is able to represent the original graph information more comprehensively.
arXiv Detail & Related papers (2020-04-30T14:16:19Z) - What BERT Sees: Cross-Modal Transfer for Visual Question Generation [21.640299110619384]
We study the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data.
We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations.
arXiv Detail & Related papers (2020-02-25T12:44:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.