Related papers: Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus

Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus

URL: http://arxiv.org/abs/2404.08664v1
Date: Fri, 29 Mar 2024 13:15:46 GMT
Title: Identifying Banking Transaction Descriptions via Support Vector Machine Short-Text Classification Based on a Specialized Labelled Corpus
Authors: Silvia García-Méndez, Milagros Fernández-Gavilanes, Jonathan Juncal-Martínez, Francisco J. González-Castaño, Oscar Barba Seara,
Abstract summary: We describe a novel system that combines Natural Language Processing techniques with Machine Learning algorithms to classify banking transaction descriptions. Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance. We present a use case with a personal finance application, CoinScrap, which is available at Google Play and App Store.
Score: 7.046417074932257
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Short texts are omnipresent in real-time news, social network commentaries, etc. Traditional text representation methods have been successfully applied to self-contained documents of medium size. However, information in short texts is often insufficient, due, for example, to the use of mnemonics, which makes them hard to classify. Therefore, the particularities of specific domains must be exploited. In this article we describe a novel system that combines Natural Language Processing techniques with Machine Learning algorithms to classify banking transaction descriptions for personal finance management, a problem that was not previously considered in the literature. We trained and tested that system on a labelled dataset with real customer transactions that will be available to other researchers on request. Motivated by existing solutions in spam detection, we also propose a short text similarity detector to reduce training set size based on the Jaccard distance. Experimental results with a two-stage classifier combining this detector with a SVM indicate a high accuracy in comparison with alternative approaches, taking into account complexity and computing time. Finally, we present a use case with a personal finance application, CoinScrap, which is available at Google Play and App Store.

Related papers

Normalisation of SWIFT Message Counterparties with Feature Extraction and Clustering [0.0]
We propose a hybrid string similarity, topic modelling, hierarchical clustering and rule-based pipeline to facilitate clustering of transaction counterparties.<n>The approach retains most of the interpretability found in rule-based systems, as the former adds an additional level of cluster refinement to the latter.<n>When only a subset of the population needs to be investigated, such as in sanctions investigations, the approach allows for better control of the risks of missing entity variations.
arXiv Detail & Related papers (2025-08-24T12:41:44Z)
FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering [57.18367828883773]
FinAgentBench is a benchmark for evaluating agentic retrieval with multi-step reasoning in finance.<n>The benchmark consists of 26K expert-annotated examples on S&P-500 listed firms.<n>We evaluate a suite of state-of-the-art models and demonstrate how targeted fine-tuning can significantly improve agentic retrieval performance.
arXiv Detail & Related papers (2025-08-07T22:15:22Z)
Word Embedding Techniques for Classification of Star Ratings [0.0]
This research uses a novel dataset of telecom customers' reviews to perform an extensive study showing how different word embedding algorithms can affect the text classification process. Several state-of-the-art word embedding techniques are considered, including BERT, Word2Vec and Doc2Vec, coupled with several classification algorithms.
arXiv Detail & Related papers (2025-04-18T12:26:28Z)
A Context-Driven Training-Free Network for Lightweight Scene Text Segmentation and Recognition [32.142713322062306]
Text recognition systems often depend on large end-to-end architectures that require extensive training and are prohibitively expensive for real-time scenarios. We propose a training-free plug-and-play framework that leverages the strengths of pre-trained text recognizers while minimizing redundant computations. Our approach uses context-based understanding and introduces an attention-based segmentation stage, which refines candidate text regions at the pixel level.
arXiv Detail & Related papers (2025-03-19T18:51:01Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
Bridging Research and Readers: A Multi-Modal Automated Academic Papers Interpretation System [47.13932723910289]
We introduce an open-source multi-modal automated academic paper interpretation system (MMAPIS) with three-step process stages. It employs the hybrid modality preprocessing and alignment module to extract plain text, and tables or figures from documents separately. It then aligns this information based on the section names they belong to, ensuring that data with identical section names are categorized under the same section. It utilizes the extracted section names to divide the article into shorter text segments, facilitating specific summarizations both within and between sections via LLMs.
arXiv Detail & Related papers (2024-01-17T11:50:53Z)
MetricPrompt: Prompting Model as a Relevance Metric for Few-shot Text Classification [65.51149771074944]
MetricPrompt eases verbalizer design difficulty by reformulating few-shot text classification task into text pair relevance estimation task. We conduct experiments on three widely used text classification datasets across four few-shot settings. Results show that MetricPrompt outperforms manual verbalizer and other automatic verbalizer design methods across all few-shot settings.
arXiv Detail & Related papers (2023-06-15T06:51:35Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Scalable and Weakly Supervised Bank Transaction Classification [0.0]
This paper aims to categorize bank transactions using weak supervision, natural language processing, and deep neural network training. We present an effective and scalable end-to-end data pipeline, including data preprocessing, transaction text embedding, anchoring, label generation, discriminative neural network training.
arXiv Detail & Related papers (2023-05-28T23:12:12Z)
Actively Discovering New Slots for Task-oriented Conversation [19.815466126158785]
We propose a general new slot task in an information extraction fashion to realize human-in-the-loop learning. We leverage existing language tools to extract value candidates where the corresponding labels are leveraged as weak supervision signals. We conduct extensive experiments on several public datasets and compare with a bunch of competitive baselines to demonstrate our method.
arXiv Detail & Related papers (2023-05-06T13:33:33Z)
A pipeline and comparative study of 12 machine learning models for text classification [0.0]
Text-based communication is highly favoured as a communication method, especially in business environments. Many machine learning methods for text classification have been proposed and incorporated into the services of most email providers. However, optimising text classification algorithms and finding the right tradeoff on their aggressiveness is still a major research problem.
arXiv Detail & Related papers (2022-04-04T23:51:22Z)
Multi-class Text Classification using BERT-based Active Learning [4.028503203417233]
Classifying customer transactions into multiple categories helps understand the market needs for different customer segments. BERT-based models have proven to perform well in Natural Language Understanding. We benchmark the performance of BERT across different Active Learning strategies in Multi-Class Text Classification.
arXiv Detail & Related papers (2021-04-27T19:49:39Z)
Conditioned Text Generation with Transfer for Closed-Domain Dialogue Systems [65.48663492703557]
We show how to optimally train and control the generation of intent-specific sentences using a conditional variational autoencoder. We introduce a new protocol called query transfer that allows to leverage a large unlabelled dataset.
arXiv Detail & Related papers (2020-11-03T14:06:10Z)
Predicting Themes within Complex Unstructured Texts: A Case Study on Safeguarding Reports [66.39150945184683]
We focus on the problem of automatically identifying the main themes in a safeguarding report using supervised classification approaches. Our results show the potential of deep learning models to simulate subject-expert behaviour even for complex tasks with limited labelled data.
arXiv Detail & Related papers (2020-10-27T19:48:23Z)
TextScanner: Reading Characters in Order for Robust Scene Text Recognition [60.04267660533966]
TextScanner is an alternative approach for scene text recognition. It generates pixel-wise, multi-channel segmentation maps for character class, position and order. It also adopts RNN for context modeling and performs paralleled prediction for character position and class.
arXiv Detail & Related papers (2019-12-28T07:52:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.