Retrieval-Augmented Code Generation for Universal Information Extraction
- URL: http://arxiv.org/abs/2311.02962v1
- Date: Mon, 6 Nov 2023 09:03:21 GMT
- Title: Retrieval-Augmented Code Generation for Universal Information Extraction
- Authors: Yucan Guo, Zixuan Li, Xiaolong Jin, Yantao Liu, Yutao Zeng, Wenxuan
Liu, Xiang Li, Pan Yang, Long Bai, Jiafeng Guo and Xueqi Cheng
- Abstract summary: Information Extraction aims to extract structural knowledge from natural language texts.
We propose a universal retrieval-augmented code generation framework based on Large Language Models (LLMs)
Code4UIE adopts Python classes to define task-specific schemas of various structural knowledge in a universal way.
- Score: 66.68673051922497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Information Extraction (IE) aims to extract structural knowledge (e.g.,
entities, relations, events) from natural language texts, which brings
challenges to existing methods due to task-specific schemas and complex text
expressions. Code, as a typical kind of formalized language, is capable of
describing structural knowledge under various schemas in a universal way. On
the other hand, Large Language Models (LLMs) trained on both codes and texts
have demonstrated powerful capabilities of transforming texts into codes, which
provides a feasible solution to IE tasks. Therefore, in this paper, we propose
a universal retrieval-augmented code generation framework based on LLMs, called
Code4UIE, for IE tasks. Specifically, Code4UIE adopts Python classes to define
task-specific schemas of various structural knowledge in a universal way. By so
doing, extracting knowledge under these schemas can be transformed into
generating codes that instantiate the predefined Python classes with the
information in texts. To generate these codes more precisely, Code4UIE adopts
the in-context learning mechanism to instruct LLMs with examples. In order to
obtain appropriate examples for different tasks, Code4UIE explores several
example retrieval strategies, which can retrieve examples semantically similar
to the given texts. Extensive experiments on five representative IE tasks
across nine datasets demonstrate the effectiveness of the Code4UIE framework.
Related papers
- CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [103.116634967815]
We introduce CodeXEmbed, a family of large-scale code embedding models ranging from 400M to 7B parameters.
Our novel training pipeline unifies multiple programming languages and transforms various code-related tasks into a common retrieval framework.
Our 7B model sets a new state-of-the-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.
arXiv Detail & Related papers (2024-11-19T16:54:45Z) - Building A Coding Assistant via the Retrieval-Augmented Language Model [24.654428111628242]
We propose a retrieval-augmeNted language model (CONAN) to build a code assistant by mimicking the knowledge-seeking behaviors of humans during coding.
It consists of a code structure aware retriever (CONAN-R) and a dual-view code representation-based retrieval-augmented generation model (CONAN-G)
arXiv Detail & Related papers (2024-10-21T17:34:39Z) - DocCGen: Document-based Controlled Code Generation [33.19206322891497]
DocCGen is a framework that can leverage rich knowledge by breaking the NL-to-Code generation task for structured code languages into a two-step process.
Our experiments show that DocCGen consistently improves different-sized language models across all six evaluation metrics.
arXiv Detail & Related papers (2024-06-17T08:34:57Z) - Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text [1.9185059111021852]
We investigate how pre-trained Language Models handle code-switched text in three dimensions.
Our findings reveal that pre-trained language models are effective in generalising to code-switched text.
arXiv Detail & Related papers (2024-03-07T19:46:03Z) - Exploring Large Language Models for Code Explanation [3.2570216147409514]
Large Language Models (LLMs) have made remarkable strides in Natural Language Processing.
This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs.
arXiv Detail & Related papers (2023-10-25T14:38:40Z) - CodeIE: Large Code Generation Models are Better Few-Shot Information
Extractors [92.17328076003628]
Large language models (LLMs) pre-trained on massive corpora have demonstrated impressive few-shot learning ability on many NLP tasks.
In this paper, we propose to recast the structured output in the form of code instead of natural language.
arXiv Detail & Related papers (2023-05-09T18:40:31Z) - Unified Text Structuralization with Instruction-tuned Language Models [28.869098023025753]
We propose a simple and efficient approach to instruct large language model (LLM) to extract a variety of structures from texts.
Experiments show that this approach can enable language models to perform comparable with other state-of-the-art methods on datasets of a variety of languages and knowledge.
arXiv Detail & Related papers (2023-03-27T07:39:05Z) - Code4Struct: Code Generation for Few-Shot Event Structure Prediction [55.14363536066588]
We propose Code4Struct to leverage text-to-structure translation capability to tackle structured prediction tasks.
We formulate Event Argument Extraction (EAE) as converting text into event-argument structures that can be represented as a class object using code.
Code4Struct is comparable to supervised models trained on 4,202 instances and outperforms current state-of-the-art (SOTA) trained on 20-shot data by 29.5% absolute F1.
arXiv Detail & Related papers (2022-10-23T18:18:51Z) - TegTok: Augmenting Text Generation via Task-specific and Open-world
Knowledge [83.55215993730326]
We propose augmenting TExt Generation via Task-specific and Open-world Knowledge (TegTok) in a unified framework.
Our model selects knowledge entries from two types of knowledge sources through dense retrieval and then injects them into the input encoding and output decoding stages respectively.
arXiv Detail & Related papers (2022-03-16T10:37:59Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.