Unified Text Structuralization with Instruction-tuned Language Models
- URL: http://arxiv.org/abs/2303.14956v2
- Date: Thu, 30 Mar 2023 13:41:54 GMT
- Title: Unified Text Structuralization with Instruction-tuned Language Models
- Authors: Xuanfan Ni, Piji Li and Huayang Li
- Abstract summary: We propose a simple and efficient approach to instruct large language model (LLM) to extract a variety of structures from texts.
Experiments show that this approach can enable language models to perform comparable with other state-of-the-art methods on datasets of a variety of languages and knowledge.
- Score: 28.869098023025753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text structuralization is one of the important fields of natural language
processing (NLP) consists of information extraction (IE) and structure
formalization. However, current studies of text structuralization suffer from a
shortage of manually annotated high-quality datasets from different domains and
languages, which require specialized professional knowledge. In addition, most
IE methods are designed for a specific type of structured data, e.g., entities,
relations, and events, making them hard to generalize to others. In this work,
we propose a simple and efficient approach to instruct large language model
(LLM) to extract a variety of structures from texts. More concretely, we add a
prefix and a suffix instruction to indicate the desired IE task and structure
type, respectively, before feeding the text into a LLM. Experiments on two LLMs
show that this approach can enable language models to perform comparable with
other state-of-the-art methods on datasets of a variety of languages and
knowledge, and can generalize to other IE sub-tasks via changing the content of
instruction. Another benefit of our approach is that it can help researchers to
build datasets in low-source and domain-specific scenarios, e.g., fields in
finance and law, with low cost.
Related papers
- Raw Text is All you Need: Knowledge-intensive Multi-turn Instruction Tuning for Large Language Model [25.459787361454353]
We present a novel framework named R2S that leverages the CoD-Chain of Dialogue logic to guide large language models (LLMs) in generating knowledge-intensive multi-turn dialogues for instruction tuning.
By integrating raw documents from both open-source datasets and domain-specific web-crawled documents into a benchmark K-BENCH, we cover diverse areas such as Wikipedia (English), Science (Chinese), and Artifacts (Chinese)
arXiv Detail & Related papers (2024-07-03T12:04:10Z) - SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark.
SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets.
The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z) - A Simple but Effective Approach to Improve Structured Language Model
Output for Information Extraction [11.165093163378152]
Large language models (LLMs) have demonstrated impressive abilities in generating unstructured natural language according to instructions.
This paper introduces an efficient method, G&O, to enhance their structured text generation capabilities.
arXiv Detail & Related papers (2024-02-20T20:42:02Z) - Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for
Language Models [153.14575887549088]
We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs)
GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines.
With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills.
arXiv Detail & Related papers (2024-02-20T15:00:35Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - OntoType: Ontology-Guided and Pre-Trained Language Model Assisted Fine-Grained Entity Typing [25.516304052884397]
Fine-grained entity typing (FET) assigns entities in text with context-sensitive, fine-grained semantic types.
OntoType follows a type ontological structure, from coarse to fine, ensembles multiple PLM prompting results to generate a set of type candidates.
Our experiments on the Ontonotes, FIGER, and NYT datasets demonstrate that our method outperforms the state-of-the-art zero-shot fine-grained entity typing methods.
arXiv Detail & Related papers (2023-05-21T00:32:37Z) - Unified Structure Generation for Universal Information Extraction [58.89057387608414]
UIE can universally model different IE tasks, adaptively generate targeted structures, and collaboratively learn general IE abilities from different knowledge sources.
Experiments show that UIE achieved the state-of-the-art performance on 4 IE tasks, 13 datasets, and on all supervised, low-resource, and few-shot settings.
arXiv Detail & Related papers (2022-03-23T08:49:29Z) - LiLT: A Simple yet Effective Language-Independent Layout Transformer for
Structured Document Understanding [33.78249073009646]
We propose a simple yet effective Language-independent Layout Transformer (LiLT) for structured document understanding.
LiLT can be pre-trained on the structured documents of a single language and then directly fine-tuned on other languages.
Experimental results on eight languages have shown that LiLT can achieve competitive or even superior performance on diverse widely-used downstream benchmarks.
arXiv Detail & Related papers (2022-02-28T10:33:01Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - ERICA: Improving Entity and Relation Understanding for Pre-trained
Language Models via Contrastive Learning [97.10875695679499]
We propose a novel contrastive learning framework named ERICA in pre-training phase to obtain a deeper understanding of the entities and their relations in text.
Experimental results demonstrate that our proposed ERICA framework achieves consistent improvements on several document-level language understanding tasks.
arXiv Detail & Related papers (2020-12-30T03:35:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.