Tracing Content Requirements in Financial Documents using Multi-granularity Text Analysis
- URL: http://arxiv.org/abs/2110.14960v2
- Date: Tue, 25 Mar 2025 14:42:17 GMT
- Title: Tracing Content Requirements in Financial Documents using Multi-granularity Text Analysis
- Authors: Xiaochen Li, Domenico Bianculli, Lionel C. Briand,
- Abstract summary: The completeness (in terms of content) of financial documents is a fundamental requirement for investment funds.<n>We propose FITI to trace content requirements in financial documents with multi-granularity text analysis.<n> FITI is able to provide accurate identification with average precision, recall, and F1-score values of 0.824, 0.646, and 0.716, respectively.
- Score: 10.684109842514772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The completeness (in terms of content) of financial documents is a fundamental requirement for investment funds. To ensure completeness, financial regulators have to spend significant time carefully checking every financial document based on relevant content requirements, which prescribe the information types to be included in financial documents (e.g., the description of shares' issue conditions and procedures). However, existing techniques provide limited support to help regulators automatically identify the text chunks related to financial information types, due to the complexity of financial documents. In this paper, we propose FITI to trace content requirements in financial documents with multi-granularity text analysis. Given a new financial document, FITI first selects a set of candidate sentences for efficient information type identification. Then, to rank candidate sentences, FITI uses a combination of rule-based and data-centric approaches, by leveraging information retrieval (IR) and machine learning (ML) techniques that analyze the words, sentences, and contexts related to an information type. Finally, a heuristic-based selector, which considers both the sentence ranking and domain-specific phrases, determines a list of sentences corresponding to each information type. We evaluated FITI by assessing its effectiveness in tracing financial content requirements in 100 real-world financial documents. Experimental results show that FITI is able to provide accurate identification with average precision, recall, and F1-score values of 0.824, 0.646, and 0.716, respectively. The overall accuracy of FITI significantly outperforms the best baseline (based on a transformer language model) by 0.266 in terms of F1-score. Furthermore, FITI can help regulators detect about 80% of missing information types in financial documents
Related papers
- FinDER: Financial Dataset for Question Answering and Evaluating Retrieval-Augmented Generation [63.55583665003167]
We present FinDER, an expert-generated dataset tailored for Retrieval-Augmented Generation (RAG) in finance.
FinDER focuses on annotating search-relevant evidence by domain experts, offering 5,703 query-evidence-answer triplets.
By challenging models to retrieve relevant information from large corpora, FinDER offers a more realistic benchmark for evaluating RAG systems.
arXiv Detail & Related papers (2025-04-22T11:30:13Z) - FinSage: A Multi-aspect RAG System for Financial Filings Question Answering [7.581619443736712]
FinSage is a multi-modal pre-processing pipeline that unifies diverse data formats and generates metadata summaries.
Experiments demonstrate that FinSage achieves an impressive recall of 92.51% on 75 expert-curated questions.
FinSage has been successfully deployed as financial question-answering agent in online meetings, where it has already served more than 1,200 people.
arXiv Detail & Related papers (2025-04-20T04:58:14Z) - FinTMMBench: Benchmarking Temporal-Aware Multi-Modal RAG in Finance [79.78247299859656]
FinTMMBench is the first comprehensive benchmark for evaluating temporal-aware multi-modal Retrieval-Augmented Generation systems in finance.<n>Built from heterologous data of NASDAQ 100 companies, FinTMMBench offers three significant advantages.
arXiv Detail & Related papers (2025-03-07T07:13:59Z) - Unified Multimodal Interleaved Document Representation for Retrieval [57.65409208879344]
We propose a method that holistically embeds documents interleaved with multiple modalities.<n>We merge the representations of segmented passages into one single document representation.<n>We show that our approach substantially outperforms relevant baselines.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - DocFinQA: A Long-Context Financial Reasoning Dataset [17.752081303855263]
We introduce a long-document financial QA task.
We extend the average context length from under 700 words in FinQA to 123k words in DocFinQA.
We conduct extensive experiments over retrieval-based QA pipelines and long-context language models.
arXiv Detail & Related papers (2024-01-12T22:19:22Z) - Conversational Factor Information Retrieval Model (ConFIRM) [2.855224352436985]
Conversational Factor Information Retrieval Method (ConFIRM) is a novel approach to fine-tuning large language models (LLMs) for domain-specific retrieval tasks.
We demonstrate ConFIRM's effectiveness through a case study in the finance sector, fine-tuning a Llama-2-7b model using personality-aligned data.
The resulting model achieved 91% accuracy in classifying financial queries, with an average inference time of 0.61 seconds on an NVIDIA A100 GPU.
arXiv Detail & Related papers (2023-10-06T12:31:05Z) - Fin-Fact: A Benchmark Dataset for Multimodal Financial Fact Checking and Explanation Generation [26.573578326262307]
Fin-Fact is a benchmark dataset for multimodal fact-checking within the financial domain.
It includes professional fact-checker annotations and justifications, providing expertise and credibility.
arXiv Detail & Related papers (2023-09-15T22:24:00Z) - PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance [63.51545277822702]
PIXIU is a comprehensive framework including the first financial large language model (LLMs) based on fine-tuning LLaMA with instruction data.
We propose FinMA by fine-tuning LLaMA with the constructed dataset to be able to follow instructions for various financial tasks.
We conduct a detailed analysis of FinMA and several existing LLMs, uncovering their strengths and weaknesses in handling critical financial tasks.
arXiv Detail & Related papers (2023-06-08T14:20:29Z) - Enabling and Analyzing How to Efficiently Extract Information from
Hybrid Long Documents with LLMs [48.87627426640621]
This research focuses on harnessing the potential of Large Language Models to comprehend critical information from financial reports.
We propose an Automated Financial Information Extraction framework that enhances LLMs' ability to comprehend and extract information from financial reports.
Our framework is effectively validated on GPT-3.5 and GPT-4, yielding average accuracy increases of 53.94% and 33.77%, respectively.
arXiv Detail & Related papers (2023-05-24T10:35:58Z) - FETILDA: An Effective Framework For Fin-tuned Embeddings For Long
Financial Text Documents [14.269860621624394]
We propose and implement a deep learning framework that splits long documents into chunks and utilize pre-trained LMs to process and aggregate the chunks into vector representations.
We evaluate our framework on a collection of 10-K public disclosure reports from US banks, and another dataset of reports submitted by US companies.
arXiv Detail & Related papers (2022-06-14T16:14:14Z) - FinBERT-MRC: financial named entity recognition using BERT under the
machine reading comprehension paradigm [8.17576814961648]
We formulate the FinNER task as a machine reading comprehension (MRC) problem and propose a new model termed FinBERT-MRC.
This formulation introduces significant prior information by utilizing well-designed queries, and extracts start index and end index of target entities.
We conduct experiments on a publicly available Chinese financial dataset ChFinAnn and a real-word dataset AdminPunish.
arXiv Detail & Related papers (2022-05-31T00:44:57Z) - FinQA: A Dataset of Numerical Reasoning over Financial Data [52.7249610894623]
We focus on answering deep questions over financial data, aiming to automate the analysis of a large corpus of financial documents.
We propose a new large-scale dataset, FinQA, with Question-Answering pairs over Financial reports, written by financial experts.
The results demonstrate that popular, large, pre-trained models fall far short of expert humans in acquiring finance knowledge.
arXiv Detail & Related papers (2021-09-01T00:08:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.