FiNER: Financial Numeric Entity Recognition for XBRL Tagging
- URL: http://arxiv.org/abs/2203.06482v1
- Date: Sat, 12 Mar 2022 16:43:57 GMT
- Title: FiNER: Financial Numeric Entity Recognition for XBRL Tagging
- Authors: Lefteris Loukas, Manos Fergadiotis, Ilias Chalkidis, Eirini
Spyropoulou, Prodromos Malakasiotis, Ion Androutsopoulos, Georgios Paliouras
- Abstract summary: We introduce tagging as a new entity extraction task for the financial domain.
We release FiNER-139, a dataset of 1.1M sentences with gold tags.
We show that subword fragmentation of numeric expressions harms BERT's performance.
- Score: 29.99876910165977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Publicly traded companies are required to submit periodic reports with
eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging
the reports is tedious and costly. We, therefore, introduce XBRL tagging as a
new entity extraction task for the financial domain and release FiNER-139, a
dataset of 1.1M sentences with gold XBRL tags. Unlike typical entity extraction
datasets, FiNER-139 uses a much larger label set of 139 entity types. Most
annotated tokens are numeric, with the correct tag per token depending mostly
on context, rather than the token itself. We show that subword fragmentation of
numeric expressions harms BERT's performance, allowing word-level BILSTMs to
perform better. To improve BERT's performance, we propose two simple and
effective solutions that replace numeric expressions with pseudo-tokens
reflecting original token shapes and numeric magnitudes. We also experiment
with FIN-BERT, an existing BERT model for the financial domain, and release our
own BERT (SEC-BERT), pre-trained on financial filings, which performs best.
Through data and error analysis, we finally identify possible limitations to
inspire future work on XBRL tagging.
Related papers
- Parameter-Efficient Instruction Tuning of Large Language Models For Extreme Financial Numeral Labelling [29.84946857859386]
We study the problem of automatically annotating relevant numerals occurring in the financial documents with their corresponding tags.
We propose a parameter efficient solution for the task using LoRA.
Our proposed model, FLAN-FinXC, achieves new state-of-the-art performances on both the datasets.
arXiv Detail & Related papers (2024-05-03T16:41:36Z) - Financial Numeric Extreme Labelling: A Dataset and Benchmarking for XBRL
Tagging [23.01422165679548]
The U.S. Securities and Exchange Commission (SEC) mandates all public companies to file periodic financial statements that should contain numerals with a particular label from a taxonomy.
We formulate the task of formulate the task of a label to a particular numeral span in a sentence from an extremely large label set.
arXiv Detail & Related papers (2023-06-06T14:41:30Z) - GPT-NER: Named Entity Recognition via Large Language Models [58.609582116612934]
GPT-NER transforms the sequence labeling task to a generation task that can be easily adapted by Language Models.
We find that GPT-NER exhibits a greater ability in the low-resource and few-shot setups, when the amount of training data is extremely scarce.
This demonstrates the capabilities of GPT-NER in real-world NER applications where the number of labeled examples is limited.
arXiv Detail & Related papers (2023-04-20T16:17:26Z) - German BERT Model for Legal Named Entity Recognition [0.43461794560295636]
We fine-tune a popular BERT language model trained on German data (German BERT) on a Legal Entity Recognition (LER) dataset.
The results we achieve by fine-tuning German BERT on the LER dataset outperform the BiLSTM-CRF+ model used by the authors of the same LER dataset.
arXiv Detail & Related papers (2023-03-07T11:54:39Z) - FinBERT-MRC: financial named entity recognition using BERT under the
machine reading comprehension paradigm [8.17576814961648]
We formulate the FinNER task as a machine reading comprehension (MRC) problem and propose a new model termed FinBERT-MRC.
This formulation introduces significant prior information by utilizing well-designed queries, and extracts start index and end index of target entities.
We conduct experiments on a publicly available Chinese financial dataset ChFinAnn and a real-word dataset AdminPunish.
arXiv Detail & Related papers (2022-05-31T00:44:57Z) - MarkBERT: Marking Word Boundaries Improves Chinese BERT [67.53732128091747]
MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words.
Compared to previous word-based BERT models, MarkBERT achieves better accuracy on text classification, keyword recognition, and semantic similarity tasks.
arXiv Detail & Related papers (2022-03-12T08:43:06Z) - Pretraining without Wordpieces: Learning Over a Vocabulary of Millions
of Words [50.11559460111882]
We explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces.
Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension.
Since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets.
arXiv Detail & Related papers (2022-02-24T15:15:48Z) - Lex-BERT: Enhancing BERT based NER with lexicons [1.6884834576352221]
We represent Lex-BERT, which incorporates the lexicon information into Chinese BERT for named entity recognition tasks.
Our model does not introduce any new parameters and are more efficient than FLAT.
arXiv Detail & Related papers (2021-01-02T07:43:21Z) - Incorporating BERT into Parallel Sequence Decoding with Adapters [82.65608966202396]
We propose to take two different BERT models as the encoder and decoder respectively, and fine-tune them by introducing simple and lightweight adapter modules.
We obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models.
Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT.
arXiv Detail & Related papers (2020-10-13T03:25:15Z) - Incorporating BERT into Neural Machine Translation [251.54280200353674]
We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence.
We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets.
arXiv Detail & Related papers (2020-02-17T08:13:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.