WebIE: Faithful and Robust Information Extraction on the Web
- URL: http://arxiv.org/abs/2305.14293v2
- Date: Thu, 15 Jun 2023 13:51:36 GMT
- Title: WebIE: Faithful and Robust Information Extraction on the Web
- Authors: Chenxi Whitehouse, Clara Vania, Alham Fikri Aji, Christos
Christodoulopoulos, Andrea Pierleoni
- Abstract summary: We present WebIE, the first large-scale, entity-linked closed IE dataset consisting of 1.6M sentences.
WebIE includes negative examples, i.e. sentences without fact triples, to better reflect the data on the web.
We evaluate the in-domain, out-of-domain, and zero-shot cross-lingual performance of generative IE models and find models trained on WebIE show better generalisability.
- Score: 7.361265860494963
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Extracting structured and grounded fact triples from raw text is a
fundamental task in Information Extraction (IE). Existing IE datasets are
typically collected from Wikipedia articles, using hyperlinks to link entities
to the Wikidata knowledge base. However, models trained only on Wikipedia have
limitations when applied to web domains, which often contain noisy text or text
that does not have any factual information. We present WebIE, the first
large-scale, entity-linked closed IE dataset consisting of 1.6M sentences
automatically collected from the English Common Crawl corpus. WebIE also
includes negative examples, i.e. sentences without fact triples, to better
reflect the data on the web. We annotate ~21K triples from WebIE through
crowdsourcing and introduce mWebIE, a translation of the annotated set in four
other languages: French, Spanish, Portuguese, and Hindi. We evaluate the
in-domain, out-of-domain, and zero-shot cross-lingual performance of generative
IE models and find models trained on WebIE show better generalisability. We
also propose three training strategies that use entity linking as an auxiliary
task. Our experiments show that adding Entity-Linking objectives improves the
faithfulness of our generative IE models.
Related papers
- Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia [14.221520251569173]
We develop a framework for entity insertion called LocEI and its multilingual variant XLocEI.
We show that XLocEI outperforms all baseline models and can be applied in a zero-shot manner on languages not seen during training with minimal performance drop.
These findings are important for applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.
arXiv Detail & Related papers (2024-10-05T18:22:15Z) - ADELIE: Aligning Large Language Models on Information Extraction [55.60192044049083]
Large language models (LLMs) usually fall short on information extraction tasks.
In this paper, we introduce ADELIE, an aligned LLM that effectively solves various IE tasks.
We show that our models achieve state-of-the-art (SoTA) performance among open-source models.
arXiv Detail & Related papers (2024-05-08T12:24:52Z) - Mind2Web: Towards a Generalist Agent for the Web [25.363429937913065]
Mind2Web is the first dataset for developing and evaluating generalist agents for the web.
With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains, Mind2Web provides three necessary ingredients for building generalist web agents.
Based on Mind2Web, we conduct an initial exploration of using large language models (LLMs) for building generalist web agents.
arXiv Detail & Related papers (2023-06-09T17:44:31Z) - InstructIE: A Bilingual Instruction-based Information Extraction Dataset [44.65162892808696]
Large language models can perform well on general natural language tasks, but their effectiveness is still suboptimal for information extraction (IE)
Recent works indicate that the main reason lies in the lack of extensive data on IE instructions.
We introduce InstructIE, a bilingual instruction-based IE dataset, which covers 12 diverse domains.
arXiv Detail & Related papers (2023-05-19T08:51:11Z) - Easy-to-Hard Learning for Information Extraction [57.827955646831526]
Information extraction systems aim to automatically extract structured information from unstructured texts.
We propose a unified easy-to-hard learning framework consisting of three stages, i.e., the easy stage, the hard stage, and the main stage.
By breaking down the learning process into multiple stages, our framework facilitates the model to acquire general IE task knowledge and improve its generalization ability.
arXiv Detail & Related papers (2023-05-16T06:04:14Z) - A Suite of Generative Tasks for Multi-Level Multimodal Webpage
Understanding [66.6468787004067]
We introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data.
We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context.
arXiv Detail & Related papers (2023-05-05T16:38:05Z) - ChatIE: Zero-Shot Information Extraction via Chatting with ChatGPT [89.49161588240061]
Zero-shot information extraction (IE) aims to build IE systems from the unannotated text.
Recent efforts on large language models (LLMs, e.g., GPT-3, ChatGPT) show promising performance on zero-shot settings.
We transform the zero-shot IE task into a multi-turn question-answering problem with a two-stage framework (ChatIE)
arXiv Detail & Related papers (2023-02-20T12:57:12Z) - Mapping Process for the Task: Wikidata Statements to Text as Wikipedia
Sentences [68.8204255655161]
We propose our mapping process for the task of converting Wikidata statements to natural language text (WS2T) for Wikipedia projects at the sentence level.
The main step is to organize statements, represented as a group of quadruples and triples, and then to map them to corresponding sentences in English Wikipedia.
We evaluate the output corpus in various aspects: sentence structure analysis, noise filtering, and relationships between sentence components based on word embedding models.
arXiv Detail & Related papers (2022-10-23T08:34:33Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.