Diagnosing BERT with Retrieval Heuristics
- URL: http://arxiv.org/abs/2201.04458v1
- Date: Wed, 12 Jan 2022 13:11:17 GMT
- Title: Diagnosing BERT with Retrieval Heuristics
- Authors: Arthur C\^amara, Claudia Hauff
- Abstract summary: "vanilla BERT" has been shown to outperform existing retrieval algorithms by a wide margin.
In this paper, we employ the recently proposed axiomatic dataset analysis technique.
We find BERT, when applied to a recently released large-scale web corpus with ad-hoc topics, to emphnot adhere to any of the explored axioms.
- Score: 8.299945169799793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Word embeddings, made widely popular in 2013 with the release of word2vec,
have become a mainstay of NLP engineering pipelines. Recently, with the release
of BERT, word embeddings have moved from the term-based embedding space to the
contextual embedding space -- each term is no longer represented by a single
low-dimensional vector but instead each term and \emph{its context} determine
the vector weights. BERT's setup and architecture have been shown to be general
enough to be applicable to many natural language tasks. Importantly for
Information Retrieval (IR), in contrast to prior deep learning solutions to IR
problems which required significant tuning of neural net architectures and
training regimes, "vanilla BERT" has been shown to outperform existing
retrieval algorithms by a wide margin, including on tasks and corpora that have
long resisted retrieval effectiveness gains over traditional IR baselines (such
as Robust04). In this paper, we employ the recently proposed axiomatic dataset
analysis technique -- that is, we create diagnostic datasets that each fulfil a
retrieval heuristic (both term matching and semantic-based) -- to explore what
BERT is able to learn. In contrast to our expectations, we find BERT, when
applied to a recently released large-scale web corpus with ad-hoc topics, to
\emph{not} adhere to any of the explored axioms. At the same time, BERT
outperforms the traditional query likelihood retrieval model by 40\%. This
means that the axiomatic approach to IR (and its extension of diagnostic
datasets created for retrieval heuristics) may in its current form not be
applicable to large-scale corpora. Additional -- different -- axioms are
needed.
Related papers
- Utilizing BERT for Information Retrieval: Survey, Applications,
Resources, and Challenges [4.588192657854766]
This survey focuses on approaches that apply pretrained transformer encoders like BERT to information retrieval (IR)
We group them into six high-level categories: (i) handling long documents, (ii) integrating semantic information, (iii) balancing effectiveness and efficiency, (iv) predicting the weights of terms, (v) query expansion, and (vi) document expansion.
We find that for specific tasks, finely tuned BERT encoders still outperform, and at a lower deployment cost.
arXiv Detail & Related papers (2024-02-18T23:22:40Z) - All Data on the Table: Novel Dataset and Benchmark for Cross-Modality
Scientific Information Extraction [39.05577374775964]
We propose a semi-supervised pipeline for annotating entities in text, as well as entities and relations in tables, in an iterative procedure.
We release novel resources for the scientific community, including a high-quality benchmark, a large-scale corpus, and a semi-supervised annotation pipeline.
arXiv Detail & Related papers (2023-11-14T14:22:47Z) - Augmented Embeddings for Custom Retrievals [13.773007276544913]
We introduce Adapted Dense Retrieval, a mechanism to transform embeddings to enable improved task-specific, heterogeneous and strict retrieval.
Dense Retrieval works by learning a low-rank residual adaptation of the pretrained black-box embedding.
arXiv Detail & Related papers (2023-10-09T03:29:35Z) - Building Interpretable and Reliable Open Information Retriever for New
Domains Overnight [67.03842581848299]
Information retrieval is a critical component for many down-stream tasks such as open-domain question answering (QA)
We propose an information retrieval pipeline that uses entity/event linking model and query decomposition model to focus more accurately on different information units of the query.
We show that, while being more interpretable and reliable, our proposed pipeline significantly improves passage coverages and denotation accuracies across five IR and QA benchmarks.
arXiv Detail & Related papers (2023-08-09T07:47:17Z) - Enriching Relation Extraction with OpenIE [70.52564277675056]
Relation extraction (RE) is a sub-discipline of information extraction (IE)
In this work, we explore how recent approaches for open information extraction (OpenIE) may help to improve the task of RE.
Our experiments over two annotated corpora, KnowledgeNet and FewRel, demonstrate the improved accuracy of our enriched models.
arXiv Detail & Related papers (2022-12-19T11:26:23Z) - CorpusBrain: Pre-train a Generative Retrieval Model for
Knowledge-Intensive Language Tasks [62.22920673080208]
Single-step generative model can dramatically simplify the search process and be optimized in end-to-end manner.
We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index.
arXiv Detail & Related papers (2022-08-16T10:22:49Z) - A Large Scale Search Dataset for Unbiased Learning to Rank [51.97967284268577]
We introduce the Baidu-ULTR dataset for unbiased learning to rank.
It involves randomly sampled 1.2 billion searching sessions and 7,008 expert annotated queries.
It provides: (1) the original semantic feature and a pre-trained language model for easy usage; (2) sufficient display information such as position, displayed height, and displayed abstract; and (3) rich user feedback on search result pages (SERPs) like dwelling time.
arXiv Detail & Related papers (2022-07-07T02:37:25Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Hierarchical Neural Network Approaches for Long Document Classification [3.6700088931938835]
We employ pre-trained Universal Sentence (USE) and Bidirectional Representations from Transformers (BERT) in a hierarchical setup to capture better representations efficiently.
Our proposed models are conceptually simple where we divide the input data into chunks and then pass this through base models of BERT and USE.
We show that USE + CNN/LSTM performs better than its stand-alone baseline. Whereas the BERT + CNN/LSTM performs on par with its stand-alone counterpart.
arXiv Detail & Related papers (2022-01-18T07:17:40Z) - Maps Search Misspelling Detection Leveraging Domain-Augmented Contextual
Representations [4.619541348328937]
Building an independent misspelling detector and serve it before correction can bring multiple benefits to speller and other search components.
With rapid development of deep learning and substantial advancement in contextual representation learning such as BERTology, building a decent misspelling detector without having to rely on hand-crafted features associated with noisy-channel architecture becomes more-than-ever accessible.
In this paper we design 4 stages of models for misspeling detection ranging from the most basic LSTM to single-domain augmented fine-tuned BERT.
arXiv Detail & Related papers (2021-08-15T23:59:12Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.