Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset
- URL: http://arxiv.org/abs/2412.20072v2
- Date: Tue, 31 Dec 2024 03:11:03 GMT
- Title: Extract Information from Hybrid Long Documents Leveraging LLMs: A Framework and Dataset
- Authors: Chongjian Yue, Xinrun Xu, Xiaojun Ma, Lun Du, Zhiming Ding, Shi Han, Dongmei Zhang, Qi Zhang,
- Abstract summary: Large Language Models (LLMs) can comprehend and analyze hybrid text, containing textual and tabular data.
We propose an Automated Information Extraction framework (AIE) to enable LLMs to process the hybrid long documents (HLDs) and carry out experiments to analyse four important aspects of information extraction from HLDs.
To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset.
- Score: 52.286323454512996
- License:
- Abstract: Large Language Models (LLMs) demonstrate exceptional performance in textual understanding and tabular reasoning tasks. However, their ability to comprehend and analyze hybrid text, containing textual and tabular data, remains unexplored. The hybrid text often appears in the form of hybrid long documents (HLDs), which far exceed the token limit of LLMs. Consequently, we apply an Automated Information Extraction framework (AIE) to enable LLMs to process the HLDs and carry out experiments to analyse four important aspects of information extraction from HLDs. Given the findings: 1) The effective way to select and summarize the useful part of a HLD. 2) An easy table serialization way is enough for LLMs to understand tables. 3) The naive AIE has adaptability in many complex scenarios. 4) The useful prompt engineering to enhance LLMs on HLDs. To address the issue of dataset scarcity in HLDs and support future work, we also propose the Financial Reports Numerical Extraction (FINE) dataset. The dataset and code are publicly available in the attachments.
Related papers
- Scaling Up Summarization: Leveraging Large Language Models for Long Text Extractive Summarization [0.27624021966289597]
This paper introduces EYEGLAXS, a framework that leverages Large Language Models (LLMs) for extractive summarization.
EYEGLAXS focuses on extractive summarization to ensure factual and grammatical integrity.
The system sets new performance benchmarks on well-known datasets like PubMed and ArXiv.
arXiv Detail & Related papers (2024-08-28T13:52:19Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations.
Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z) - A Practice-Friendly LLM-Enhanced Paradigm with Preference Parsing for Sequential Recommendation [15.153844486572932]
This paper proposes a practice-friendly LLM-enhanced paradigm with preference parsing (P2Rec) for sequential recommender systems (SRS)
Specifically, in the information reconstruction stage, we design a new user-level SFT task for collaborative information injection with the assistance of a pre-trained SRS model.
Our goal is to let LLM learn to reconstruct a corresponding prior preference distribution from each user's interaction sequence.
arXiv Detail & Related papers (2024-06-01T07:18:56Z) - Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation [128.01050030936028]
We propose an information refinement training method named InFO-RAG.
InFO-RAG is low-cost and general across various tasks.
It improves the performance of LLaMA2 by an average of 9.39% relative points.
arXiv Detail & Related papers (2024-02-28T08:24:38Z) - Large Language Models for Data Annotation and Synthesis: A Survey [49.8318827245266]
This survey focuses on the utility of Large Language Models for data annotation and synthesis.
It includes an in-depth taxonomy of data types that LLMs can annotate, a review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation and synthesis.
arXiv Detail & Related papers (2024-02-21T00:44:04Z) - Large Language Model with Graph Convolution for Recommendation [21.145230388035277]
Text information can sometimes be of low quality, hindering its effectiveness for real-world applications.
With knowledge and reasoning capabilities capsuled in Large Language Models, utilizing LLMs emerges as a promising way for description improvement.
We propose a Graph-aware Convolutional LLM method to elicit LLMs to capture high-order relations in the user-item graph.
arXiv Detail & Related papers (2024-02-14T00:04:33Z) - INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning [59.07490387145391]
Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks.
Their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language.
We introduce a novel instruction tuning dataset, INTERS, encompassing 20 tasks across three fundamental IR categories.
arXiv Detail & Related papers (2024-01-12T12:10:28Z) - How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model [12.358079352117699]
We explore Multimodal Large Language Models (MLLMs), which integrate LLMs to handle multimodal data, including text, images, audio, and more.
MLLMs face challenges in addressing the semantic gap in multimodal data, which may lead to erroneous outputs.
Implementing effective modality alignment can help LLMs address environmental issues and enhance accessibility.
arXiv Detail & Related papers (2023-11-10T09:51:24Z) - Investigating Table-to-Text Generation Capabilities of LLMs in
Real-World Information Seeking Scenarios [32.84523661055774]
Tabular data is prevalent across various industries, necessitating significant time and effort for users to understand and manipulate for their information-seeking purposes.
The adoption of large language models (LLMs) in real-world applications for table information seeking remains underexplored.
This paper investigates the table-to-text capabilities of different LLMs using four datasets within two real-world information seeking scenarios.
arXiv Detail & Related papers (2023-05-24T10:22:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.