Taxon: Hierarchical Tax Code Prediction with Semantically Aligned LLM Expert Guidance
- URL: http://arxiv.org/abs/2601.08418v1
- Date: Tue, 13 Jan 2026 10:41:23 GMT
- Title: Taxon: Hierarchical Tax Code Prediction with Semantically Aligned LLM Expert Guidance
- Authors: Jihang Li, Qing Liu, Zulong Chen, Jing Wang, Wei Wang, Chuanfei Xu, Zeyi Wen,
- Abstract summary: Taxon is a semantically aligned and expert-guided framework for hierarchical tax code prediction.<n>Taxon has been deployed in production within Alibaba's tax service system.
- Score: 17.32251921642481
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Tax code prediction is a crucial yet underexplored task in automating invoicing and compliance management for large-scale e-commerce platforms. Each product must be accurately mapped to a node within a multi-level taxonomic hierarchy defined by national standards, where errors lead to financial inconsistencies and regulatory risks. This paper presents Taxon, a semantically aligned and expert-guided framework for hierarchical tax code prediction. Taxon integrates (i) a feature-gating mixture-of-experts architecture that adaptively routes multi-modal features across taxonomy levels, and (ii) a semantic consistency model distilled from large language models acting as domain experts to verify alignment between product titles and official tax definitions. To address noisy supervision in real business records, we design a multi-source training pipeline that combines curated tax databases, invoice validation logs, and merchant registration data to provide both structural and semantic supervision. Extensive experiments on the proprietary TaxCode dataset and public benchmarks demonstrate that Taxon achieves state-of-the-art performance, outperforming strong baselines. Further, an additional full hierarchical paths reconstruction procedure significantly improves structural consistency, yielding the highest overall F1 scores. Taxon has been deployed in production within Alibaba's tax service system, handling an average of over 500,000 tax code queries per day and reaching peak volumes above five million requests during business event with improved accuracy, interpretability, and robustness.
Related papers
- Information Extraction From Fiscal Documents Using LLMs [0.44641493866640386]
We present a novel approach to extracting structured data from multi-page government fiscal documents.<n>Our method achieves high accuracy through a multi-stage pipeline that leverages domain knowledge, sequential context, and algorithmic validation.<n>Our implementation shows promise for broader applications across developing country contexts.
arXiv Detail & Related papers (2025-11-03T19:17:49Z) - Domain-Adaptive Small Language Models for Structured Tax Code Prediction [0.05783229039119002]
This paper proposes a domain-adaptive small language model (SLM) with an encoder-decoder architecture for the enhanced prediction of product and service tax codes.<n>We employ an SLM based upon encoder-decoder architecture as this enables sequential generation of tax codes.<n>Our experiments demonstrate that encoder-decoder SLMs can be successfully applied to the sequential prediction of structured tax codes.
arXiv Detail & Related papers (2025-07-15T00:46:01Z) - CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts [40.52605902842168]
Taxonomies play a crucial role in various applications by providing a structural representation of knowledge.<n>Previous approaches typically relied on self-supervised methods that generate annotation data from existing taxonomy.<n>We introduce CodeTaxo, a novel approach that leverages large language models through code language prompts to capture the taxonomic structure.
arXiv Detail & Related papers (2024-08-17T02:15:07Z) - A Taxation Perspective for Fair Re-ranking [61.946428892727795]
We introduce a new fair re-ranking method named Tax-rank, which levies taxes based on the difference in utility between two items.
Our model Tax-rank offers a superior tax policy for fair re-ranking, theoretically demonstrating both continuity and controllability over accuracy loss.
arXiv Detail & Related papers (2024-04-27T08:21:29Z) - On the Potential and Limitations of Few-Shot In-Context Learning to
Generate Metamorphic Specifications for Tax Preparation Software [12.071874385139395]
Nearly 50% of taxpayers filed their individual income taxes using tax software in the U.S. in FY22.
This paper formulates the task of generating metamorphic specifications as a translation task between properties extracted from tax documents.
arXiv Detail & Related papers (2023-11-20T18:12:28Z) - Insert or Attach: Taxonomy Completion via Box Embedding [75.69894194912595]
Previous approaches embed concepts as vectors in Euclidean space, which makes it difficult to model asymmetric relations in taxonomy.
We develop a framework, TaxBox, that leverages box containment and center closeness to design two specialized geometric scorers within the box embedding space.
These scorers are tailored for insertion and attachment operations and can effectively capture intrinsic relationships between concepts.
arXiv Detail & Related papers (2023-05-18T14:34:58Z) - TaxoEnrich: Self-Supervised Taxonomy Completion via Structure-Semantic
Representations [28.65753036636082]
We propose a new taxonomy completion framework, which effectively leverages both semantic features and structural information in the existing taxonomy.
TaxoEnrich consists of four components: (1) taxonomy-contextualized embedding which incorporates both semantic meanings of concept and taxonomic relations based on powerful pretrained language models; (2) a taxonomy-aware sequential encoder which learns candidate position representations by encoding the structural information of taxonomy.
Experiments on four large real-world datasets from different domains show that TaxoEnrich achieves the best performance among all evaluation metrics and outperforms previous state-of-the-art by a large margin.
arXiv Detail & Related papers (2022-02-10T08:10:43Z) - Who Should Go First? A Self-Supervised Concept Sorting Model for
Improving Taxonomy Expansion [50.794640012673064]
As data and business scope grow in real applications, existing need to be expanded to incorporate new concepts.
Previous works on taxonomy expansion process the new concepts independently and simultaneously, ignoring the potential relationships among them and the appropriate order of inserting operations.
We propose TaxoOrder, a novel self-supervised framework that simultaneously discovers the local hypernym-hyponym structure among new concepts and decides the order of insertion.
arXiv Detail & Related papers (2021-04-08T11:00:43Z) - Octet: Online Catalog Taxonomy Enrichment with Self-Supervision [67.26804972901952]
We present a self-supervised end-to-end framework, Octet for Online Catalog EnrichmenT.
We propose to train a sequence labeling model for term extraction and employ graph neural networks (GNNs) to capture the taxonomy structure.
Octet enriches an online catalog in production to 2 times larger in the open-world evaluation.
arXiv Detail & Related papers (2020-06-18T04:53:07Z) - STEAM: Self-Supervised Taxonomy Expansion with Mini-Paths [53.45704816829921]
We propose a self-supervised taxonomy expansion model named STEAM.
STEAM generates natural self-supervision signals, and formulates a node attachment prediction task.
Experiments show STEAM outperforms state-of-the-art methods for taxonomy expansion by 11.6% in accuracy and 7.0% in mean reciprocal rank.
arXiv Detail & Related papers (2020-06-18T00:32:53Z) - TaxoExpan: Self-supervised Taxonomy Expansion with Position-Enhanced
Graph Neural Network [62.12557274257303]
Taxonomies consist of machine-interpretable semantics and provide valuable knowledge for many web applications.
We propose a novel self-supervised framework, named TaxoExpan, which automatically generates a set of query concept, anchor concept> pairs from the existing taxonomy as training data.
We develop two innovative techniques in TaxoExpan: (1) a position-enhanced graph neural network that encodes the local structure of an anchor concept in the existing taxonomy, and (2) a noise-robust training objective that enables the learned model to be insensitive to the label noise in the self-supervision data.
arXiv Detail & Related papers (2020-01-26T21:30:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.