FETILDA: An Effective Framework For Fin-tuned Embeddings For Long
Financial Text Documents
- URL: http://arxiv.org/abs/2206.06952v1
- Date: Tue, 14 Jun 2022 16:14:14 GMT
- Title: FETILDA: An Effective Framework For Fin-tuned Embeddings For Long
Financial Text Documents
- Authors: Bolun "Namir" Xia, Vipula D. Rawte, Mohammed J. Zaki, Aparna Gupta
- Abstract summary: We propose and implement a deep learning framework that splits long documents into chunks and utilize pre-trained LMs to process and aggregate the chunks into vector representations.
We evaluate our framework on a collection of 10-K public disclosure reports from US banks, and another dataset of reports submitted by US companies.
- Score: 14.269860621624394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Unstructured data, especially text, continues to grow rapidly in various
domains. In particular, in the financial sphere, there is a wealth of
accumulated unstructured financial data, such as the textual disclosure
documents that companies submit on a regular basis to regulatory agencies, such
as the Securities and Exchange Commission (SEC). These documents are typically
very long and tend to contain valuable soft information about a company's
performance. It is therefore of great interest to learn predictive models from
these long textual documents, especially for forecasting numerical key
performance indicators (KPIs). Whereas there has been a great progress in
pre-trained language models (LMs) that learn from tremendously large corpora of
textual data, they still struggle in terms of effective representations for
long documents. Our work fills this critical need, namely how to develop better
models to extract useful information from long textual documents and learn
effective features that can leverage the soft financial and risk information
for text regression (prediction) tasks. In this paper, we propose and implement
a deep learning framework that splits long documents into chunks and utilizes
pre-trained LMs to process and aggregate the chunks into vector
representations, followed by self-attention to extract valuable document-level
features. We evaluate our model on a collection of 10-K public disclosure
reports from US banks, and another dataset of reports submitted by US
companies. Overall, our framework outperforms strong baseline methods for
textual modeling as well as a baseline regression model using only numerical
data. Our work provides better insights into how utilizing pre-trained
domain-specific and fine-tuned long-input LMs in representing long documents
can improve the quality of representation of textual data, and therefore, help
in improving predictive analyses.
Related papers
- Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction [23.47150047875133]
Document parsing is essential for converting unstructured and semi-structured documents into machine-readable data.
Document parsing plays an indispensable role in both knowledge base construction and training data generation.
This paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts.
arXiv Detail & Related papers (2024-10-28T16:11:35Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.
We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.
Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models [73.13933847198395]
We propose a training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding.
The proposed LLM$times$MapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output.
arXiv Detail & Related papers (2024-10-12T03:13:44Z) - SEGMENT+: Long Text Processing with Short-Context Language Models [53.40059130780192]
SEGMENT+ is a framework that enables LMs to handle extended inputs within limited context windows efficiently.
SEGMENT+ utilizes structured notes and a filtering module to manage information flow, resulting in a system that is both controllable and interpretable.
arXiv Detail & Related papers (2024-10-09T03:40:22Z) - Integrating Planning into Single-Turn Long-Form Text Generation [66.08871753377055]
We propose to use planning to generate long form content.
Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning.
Our experiments demonstrate on two datasets from different domains, that LLMs fine-tuned with the auxiliary task generate higher quality documents.
arXiv Detail & Related papers (2024-10-08T17:02:40Z) - Leveraging Long-Context Large Language Models for Multi-Document Understanding and Summarization in Enterprise Applications [1.1682259692399921]
Long-context Large Language Models (LLMs) can grasp extensive connections, provide cohesive summaries, and adapt to various industry domains.
Case studies show notable enhancements in both efficiency and accuracy.
arXiv Detail & Related papers (2024-09-27T05:29:31Z) - LongWanjuan: Towards Systematic Measurement for Long Text Quality [102.46517202896521]
LongWanjuan is a dataset specifically tailored to enhance the training of language models for long-text tasks with over 160B tokens.
In LongWanjuan, we categorize long texts into holistic, aggregated, and chaotic types, enabling a detailed analysis of long-text quality.
We devise a data mixture recipe that strategically balances different types of long texts within LongWanjuan, leading to significant improvements in model performance on long-text tasks.
arXiv Detail & Related papers (2024-02-21T07:27:18Z) - LongFin: A Multimodal Document Understanding Model for Long Financial
Domain Documents [4.924255992661131]
We introduce LongFin, a multimodal document AI model capable of encoding up to 4K tokens.
We also propose the LongForms dataset that encapsulates several industrial challenges in financial documents.
arXiv Detail & Related papers (2024-01-26T18:23:45Z) - Large Language Model Adaptation for Financial Sentiment Analysis [2.0499240875882]
Generalist language models tend to fall short in tasks specifically tailored for finance.
Two foundation models with less than 1.5B parameters have been adapted using a wide range of strategies.
We show that small LLMs have comparable performance to larger scale models, while being more efficient in terms of parameters and data.
arXiv Detail & Related papers (2024-01-26T11:04:01Z) - Multimodal Document Analytics for Banking Process Automation [4.541582055558865]
The paper contributes original empirical evidence on the effectiveness and efficiency of multi-model models for document processing in the banking business.
It offers practical guidance on how to unlock this potential in day-to-day operations.
arXiv Detail & Related papers (2023-07-21T18:29:04Z) - Long Document Summarization with Top-down and Bottom-up Inference [113.29319668246407]
We propose a principled inference framework to improve summarization models on two aspects.
Our framework assumes a hierarchical latent structure of a document where the top-level captures the long range dependency.
We demonstrate the effectiveness of the proposed framework on a diverse set of summarization datasets.
arXiv Detail & Related papers (2022-03-15T01:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.