Wikiformer: Pre-training with Structured Information of Wikipedia for
Ad-hoc Retrieval
- URL: http://arxiv.org/abs/2312.10661v2
- Date: Mon, 1 Jan 2024 06:42:06 GMT
- Title: Wikiformer: Pre-training with Structured Information of Wikipedia for
Ad-hoc Retrieval
- Authors: Weihang Su, Qingyao Ai, Xiangsheng Li, Jia Chen, Yiqun Liu, Xiaolong
Wu, Shengluan Hou
- Abstract summary: In this paper, we devise four pre-training objectives tailored for information retrieval tasks based on the structured knowledge of Wikipedia.
Compared to existing pre-training methods, our approach can better capture the semantic knowledge in the training corpus.
Experimental results in biomedical and legal domains demonstrate that our approach achieves better performance in vertical domains.
- Score: 21.262531222066208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the development of deep learning and natural language processing
techniques, pre-trained language models have been widely used to solve
information retrieval (IR) problems. Benefiting from the pre-training and
fine-tuning paradigm, these models achieve state-of-the-art performance. In
previous works, plain texts in Wikipedia have been widely used in the
pre-training stage. However, the rich structured information in Wikipedia, such
as the titles, abstracts, hierarchical heading (multi-level title) structure,
relationship between articles, references, hyperlink structures, and the
writing organizations, has not been fully explored. In this paper, we devise
four pre-training objectives tailored for IR tasks based on the structured
knowledge of Wikipedia. Compared to existing pre-training methods, our approach
can better capture the semantic knowledge in the training corpus by leveraging
the human-edited structured data from Wikipedia. Experimental results on
multiple IR benchmark datasets show the superior performance of our model in
both zero-shot and fine-tuning settings compared to existing strong retrieval
baselines. Besides, experimental results in biomedical and legal domains
demonstrate that our approach achieves better performance in vertical domains
compared to previous models, especially in scenarios where long text similarity
matching is needed.
Related papers
- End-to-End Ontology Learning with Large Language Models [11.755755139228219]
Large language models (LLMs) have been applied to solve various subtasks of ontology learning.
We address this gap by OLLM, a general and scalable method for building the taxonomic backbone of an ontology from scratch.
In contrast to standard metrics, our metrics use deep learning techniques to define more robust structural distance measures between graphs.
Our model can be effectively adapted to new domains, like arXiv, needing only a small number of training examples.
arXiv Detail & Related papers (2024-10-31T02:52:39Z) - ATLANTIC: Structure-Aware Retrieval-Augmented Language Model for
Interdisciplinary Science [0.0]
Large language models record impressive performance on many natural language processing tasks.
Retrieval augmentation offers an effective solution by retrieving context from external knowledge sources.
We propose a novel structure-aware retrieval augmented language model that accommodates document structure during retrieval augmentation.
arXiv Detail & Related papers (2023-11-21T02:02:46Z) - Unifying Structure and Language Semantic for Efficient Contrastive
Knowledge Graph Completion with Structured Entity Anchors [0.3913403111891026]
The goal of knowledge graph completion (KGC) is to predict missing links in a KG using trained facts that are already known.
We propose a novel method to effectively unify structure information and language semantics without losing the power of inductive reasoning.
arXiv Detail & Related papers (2023-11-07T11:17:55Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - KEPLET: Knowledge-Enhanced Pretrained Language Model with Topic Entity
Awareness [12.90996504014071]
We propose KEPLET, a Knowledge-Enhanced Pre-trained LanguagE model with Topic entity awareness.
In an end-to-end manner, KEPLET identifies where to add the topic entity's information in a Wikipedia sentence.
Experiments demonstrated the generality and superiority of KEPLET which was applied to two representative KEPLMs.
arXiv Detail & Related papers (2023-05-02T22:28:26Z) - Domain-Specific Word Embeddings with Structure Prediction [3.057136788672694]
We present an empirical evaluation on New York Times articles and two English Wikipedia datasets with articles on science and philosophy.
Our method, called Word2Vec with Structure Prediction (W2VPred), provides better performance than baselines in terms of the general analogy tests.
As a use case in the field of Digital Humanities we demonstrate how to raise novel research questions for high literature from the German Text Archive.
arXiv Detail & Related papers (2022-10-06T12:45:48Z) - Joint Language Semantic and Structure Embedding for Knowledge Graph
Completion [66.15933600765835]
We propose to jointly embed the semantics in the natural language description of the knowledge triplets with their structure information.
Our method embeds knowledge graphs for the completion task via fine-tuning pre-trained language models.
Our experiments on a variety of knowledge graph benchmarks have demonstrated the state-of-the-art performance of our method.
arXiv Detail & Related papers (2022-09-19T02:41:02Z) - CorpusBrain: Pre-train a Generative Retrieval Model for
Knowledge-Intensive Language Tasks [62.22920673080208]
Single-step generative model can dramatically simplify the search process and be optimized in end-to-end manner.
We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index.
arXiv Detail & Related papers (2022-08-16T10:22:49Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.