WebBrain: Learning to Generate Factually Correct Articles for Queries by
  Grounding on Large Web Corpus
        - URL: http://arxiv.org/abs/2304.04358v1
- Date: Mon, 10 Apr 2023 02:55:48 GMT
- Title: WebBrain: Learning to Generate Factually Correct Articles for Queries by
  Grounding on Large Web Corpus
- Authors: Hongjing Qian, Yutao Zhu, Zhicheng Dou, Haoqi Gu, Xinyu Zhang, Zheng
  Liu, Ruofei Lai, Zhao Cao, Jian-Yun Nie and Ji-Rong Wen
- Abstract summary: We introduce a new NLP task -- generating short factual articles with references for queries by mining supporting evidence from the Web.
The ultimate goal is to generate a fluent, informative, and factually-correct short article for a factual query unseen in Wikipedia.
We construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia articles and their crawlable Wikipedia references.
- Score: 61.209202634703104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   In this paper, we introduce a new NLP task -- generating short factual
articles with references for queries by mining supporting evidence from the
Web. In this task, called WebBrain, the ultimate goal is to generate a fluent,
informative, and factually-correct short article (e.g., a Wikipedia article)
for a factual query unseen in Wikipedia. To enable experiments on WebBrain, we
construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia
articles and their crawlable Wikipedia references. WebBrain-Raw is ten times
larger than the previous biggest peer dataset, which can greatly benefit the
research community. From WebBrain-Raw, we construct two task-specific datasets:
WebBrain-R and WebBrain-G, which are used to train in-domain retriever and
generator, respectively. Besides, we empirically analyze the performances of
the current state-of-the-art NLP techniques on WebBrain and introduce a new
framework ReGen, which enhances the generation factualness by improved evidence
retrieval and task-specific pre-training for generation. Experiment results
show that ReGen outperforms all baselines in both automatic and human
evaluations.
 
      
        Related papers
        - Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction [83.0216122783429]
 Web Reconstruction (WebR) is a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents.
We show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks.
 arXiv  Detail & Related papers  (2025-04-22T04:07:13Z)
- CorpusBrain++: A Continual Generative Pre-Training Framework for
  Knowledge-Intensive Language Tasks [111.13988772503511]
 Knowledge-intensive language tasks (KILTs) typically require retrieving relevant documents from trustworthy corpora, e.g., Wikipedia, to produce specific answers.
Very recently, a pre-trained generative retrieval model for KILTs, named CorpusBrain, was proposed and reached new state-of-the-art retrieval performance.
 arXiv  Detail & Related papers  (2024-02-26T17:35:44Z)
- Cleaner Pretraining Corpus Curation with Neural Web Scraping [39.97459187762505]
 This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages.
 Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement.
 arXiv  Detail & Related papers  (2024-02-22T16:04:03Z)
- Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
  Text-Attributed Graph Representation Learning [51.90524745663737]
 A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
 arXiv  Detail & Related papers  (2023-05-31T03:18:03Z)
- The Web Can Be Your Oyster for Improving Large Language Models [98.72358969495835]
 Large language models (LLMs) encode a large amount of world knowledge.
We consider augmenting LLMs with the large-scale web using search engine.
We present a web-augmented LLM UNIWEB, which is trained over 16 knowledge-intensive tasks in a unified text-to-text format.
 arXiv  Detail & Related papers  (2023-05-18T14:20:32Z)
- PLM-GNN: A Webpage Classification Method based on Joint Pre-trained
  Language Model and Graph Neural Network [19.75890828376791]
 We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN.
It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
 arXiv  Detail & Related papers  (2023-05-09T12:19:10Z)
- Generate rather than Retrieve: Large Language Models are Strong Context
  Generators [74.87021992611672]
 We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
 arXiv  Detail & Related papers  (2022-09-21T01:30:59Z)
- CorpusBrain: Pre-train a Generative Retrieval Model for
  Knowledge-Intensive Language Tasks [62.22920673080208]
 Single-step generative model can dramatically simplify the search process and be optimized in end-to-end manner.
We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index.
 arXiv  Detail & Related papers  (2022-08-16T10:22:49Z)
- A Transformer-based Neural Language Model that Synthesizes Brain
  Activation Maps from Free-Form Text Queries [37.322245313730654]
 Text2Brain is an easy to use tool for synthesizing brain activation maps from open-ended text queries.
Text2Brain was built on a transformer-based neural network language model and a coordinate-based meta-analysis of neuroimaging studies.
 arXiv  Detail & Related papers  (2022-07-24T09:15:03Z)
- BrainGB: A Benchmark for Brain Network Analysis with Graph Neural
  Networks [20.07976837999997]
 We present BrainGB, a benchmark for brain network analysis with Graph Neural Networks (GNNs)
BrainGB standardizes brain network construction pipelines for both functional and structural neuroimaging modalities.
We recommend a set of general recipes for effective GNN designs on brain networks.
 arXiv  Detail & Related papers  (2022-03-17T08:31:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.