Related papers: Domain-Adaptive Small Language Models for Structured Tax Code Prediction

Domain-Adaptive Small Language Models for Structured Tax Code Prediction

URL: http://arxiv.org/abs/2507.10880v2
Date: Sat, 19 Jul 2025 21:12:12 GMT
Title: Domain-Adaptive Small Language Models for Structured Tax Code Prediction
Authors: Souvik Nath, Sumit Wadhwa, Luis Perez,
Abstract summary: This paper proposes a domain-adaptive small language model (SLM) with an encoder-decoder architecture for the enhanced prediction of product and service tax codes.<n>We employ an SLM based upon encoder-decoder architecture as this enables sequential generation of tax codes.<n>Our experiments demonstrate that encoder-decoder SLMs can be successfully applied to the sequential prediction of structured tax codes.
Score: 0.05783229039119002
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Every day, multinational firms process thousands of transactions, each of which must adhere to tax regulations that vary by jurisdiction and are often nuanced. The determination of product and service tax codes, such as HSN or SAC is a major use case in Tax compliance. An accurate determination of such codes is imperative to avoid any tax penalties. This paper proposes a domain-adaptive small language model (SLM) with an encoder-decoder architecture for the enhanced prediction of product and service tax codes. In this approach, we address the problem of predicting hierarchical tax code sequences using unstructured product and services data. We employ an SLM based upon encoder-decoder architecture as this enables sequential generation of tax codes to capture the hierarchical dependencies present within the tax codes. Our experiments demonstrate that encoder-decoder SLMs can be successfully applied to the sequential prediction of structured tax codes, a domain that remains comparatively unexplored in current NLP research. In this paper, we demonstrate the superior performance of the domain-adaptive encoder-decoder SLMs over flat classifiers when applied to the Harmonized System of Nomenclature (HSN), and achieve superior results compared to decoder-only and encoder-only architectures for structured sequence generation tasks. This approach can also be scaled to other government-mandated tax commodity codes, such as United Nations Standard Products and Services Codes (UNSPSC), or Brazil's Nomenclatura Comum do Mercosul (NCM).

Related papers

Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation [36.166087396386445]
We present Compliance-to-Code, the first large-scale Chinese dataset dedicated to financial regulatory compliance.<n> Covering 1,159 annotated clauses from 361 regulations across ten categories, each clause is modularly structured with four logical elements-subject, condition, constraint, and contextual information-along with regulation relations.<n>We provide deterministic Python code mappings, detailed code reasoning, and code explanations to facilitate automated auditing.
arXiv Detail & Related papers (2025-05-26T10:38:32Z)
Technical Challenges in Maintaining Tax Prep Software with Large Language Models [6.419602857618507]
We focus on identifying, understanding, and tackling technical challenges in leveraging Large Language Models (LLMs)<n>Our research efforts focus on identifying, understanding, and tackling technical challenges in leveraging ChatGPT and Llama to faithfully extract code differentials from IRS publications.
arXiv Detail & Related papers (2025-04-25T21:00:20Z)
CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation [69.684886175768]
Large language models (LLMs) have shown promising performance in automated code generation.<n>In this paper, we propose CodeRAG, a retrieval-augmented code generation framework.<n> Experiments show that CodeRAG achieves significant improvements compared to no RAG scenarios.
arXiv Detail & Related papers (2025-04-14T09:51:23Z)
Learnable Item Tokenization for Generative Recommendation [78.30417863309061]
We propose LETTER (a LEarnable Tokenizer for generaTivE Recommendation), which integrates hierarchical semantics, collaborative signals, and code assignment diversity. LETTER incorporates Residual Quantized VAE for semantic regularization, a contrastive alignment loss for collaborative regularization, and a diversity loss to mitigate code assignment bias.
arXiv Detail & Related papers (2024-05-12T15:49:38Z)
A Novel ICD Coding Method Based on Associated and Hierarchical Code Description Distillation [6.524062529847299]
ICD coding is a challenging multilabel text classification problem due to noisy medical document inputs. Recent advancements in automated ICD coding have enhanced performance by integrating additional data and knowledge bases with the encoding of medical notes and codes. We propose a novel framework based on associated and hierarchical code description distillation (AHDD) for better code representation learning and avoidance of improper code assignment.
arXiv Detail & Related papers (2024-04-17T07:26:23Z)
On the Potential and Limitations of Few-Shot In-Context Learning to Generate Metamorphic Specifications for Tax Preparation Software [12.071874385139395]
Nearly 50% of taxpayers filed their individual income taxes using tax software in the U.S. in FY22. This paper formulates the task of generating metamorphic specifications as a translation task between properties extracted from tax documents.
arXiv Detail & Related papers (2023-11-20T18:12:28Z)
Machine Learning-Aided Efficient Decoding of Reed-Muller Subcodes [59.55193427277134]
Reed-Muller (RM) codes achieve the capacity of general binary-input memoryless symmetric channels. RM codes only admit limited sets of rates. Efficient decoders are available for RM codes at finite lengths.
arXiv Detail & Related papers (2023-01-16T04:11:14Z)
Metamorphic Testing and Debugging of Tax Preparation Software [2.185694185279913]
We focus on an open-source tax preparation software for our case study. We develop a randomized test-case generation strategy to systematically validate the correctness of tax preparation software.
arXiv Detail & Related papers (2022-05-10T16:10:10Z)
Who Should Go First? A Self-Supervised Concept Sorting Model for Improving Taxonomy Expansion [50.794640012673064]
As data and business scope grow in real applications, existing need to be expanded to incorporate new concepts. Previous works on taxonomy expansion process the new concepts independently and simultaneously, ignoring the potential relationships among them and the appropriate order of inserting operations. We propose TaxoOrder, a novel self-supervised framework that simultaneously discovers the local hypernym-hyponym structure among new concepts and decides the order of insertion.
arXiv Detail & Related papers (2021-04-08T11:00:43Z)
COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic. COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)
TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced Graph Neural Network [62.12557274257303]
Taxonomies consist of machine-interpretable semantics and provide valuable knowledge for many web applications. We propose a novel self-supervised framework, named TaxoExpan, which automatically generates a set of query concept, anchor concept> pairs from the existing taxonomy as training data. We develop two innovative techniques in TaxoExpan: (1) a position-enhanced graph neural network that encodes the local structure of an anchor concept in the existing taxonomy, and (2) a noise-robust training objective that enables the learned model to be insensitive to the label noise in the self-supervision data.
arXiv Detail & Related papers (2020-01-26T21:30:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.