Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content
- URL: http://arxiv.org/abs/2601.05103v1
- Date: Thu, 08 Jan 2026 16:48:36 GMT
- Title: Semantically Orthogonal Framework for Citation Classification: Disentangling Intent and Content
- Authors: Changxu Duan, Zhiyin Tan,
- Abstract summary: SOFT is a Semantically Orthogonal Framework with Two dimensions that explicitly separates citation intent from cited content type.<n>We re-annotate the ACL-ARC dataset using SOFT and release a cross-disciplinary test set sampled from ACT2.<n>Results confirm SOFT's value as a clear, reusable annotation standard, improving clarity, consistency, and generalizability for digital libraries and scholarly communication infrastructures.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Understanding the role of citations is essential for research assessment and citation-aware digital libraries. However, existing citation classification frameworks often conflate citation intent (why a work is cited) with cited content type (what part is cited), limiting their effectiveness in auto classification due to a dilemma between fine-grained type distinctions and practical classification reliability. We introduce SOFT, a Semantically Orthogonal Framework with Two dimensions that explicitly separates citation intent from cited content type, drawing inspiration from semantic role theory. We systematically re-annotate the ACL-ARC dataset using SOFT and release a cross-disciplinary test set sampled from ACT2. Evaluation with both zero-shot and fine-tuned Large Language Models demonstrates that SOFT enables higher agreement between human annotators and LLMs, and supports stronger classification performance and robust cross-domain generalization compared to ACL-ARC and SciCite annotation frameworks. These results confirm SOFT's value as a clear, reusable annotation standard, improving clarity, consistency, and generalizability for digital libraries and scholarly communication infrastructures. All code and data are publicly available on GitHub https://github.com/zhiyintan/SOFT.
Related papers
- CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era [51.63024682584688]
Large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications.<n>We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing.<n>Our framework significantly outperforms prior methods in both accuracy and interpretability.
arXiv Detail & Related papers (2026-02-26T19:17:39Z) - SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning [0.0]
We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis.<n>Our approach combines multiple retrieval methods with a four-class classification system that captures nuanced claim-source relationships.<n>We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata.
arXiv Detail & Related papers (2025-11-20T10:05:21Z) - C$^2$-Cite: Contextual-Aware Citation Generation for Attributed Large Language Models [30.653055089917668]
We propose a novel textbfCon-textual-aware textbfCitation generation framework.<n>It explicitly integrates the semantic relationships between citation markers and their referenced content.<n>It outperforms the SOTA baseline by an average of 5.8% in citation quality and 17.4% in response correctness.
arXiv Detail & Related papers (2025-11-19T15:46:25Z) - FeClustRE: Hierarchical Clustering and Semantic Tagging of App Features from User Reviews [0.0]
FeClustRE is a framework integrating hybrid feature extraction, hierarchical clustering with auto-tuning and semantic labelling.<n>We evaluate FeClustRE on public benchmarks for extraction correctness and on a sample study of generative AI assistant app reviews for clustering quality, semantic coherence, and interpretability.
arXiv Detail & Related papers (2025-10-21T16:54:21Z) - Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models [44.31597857713689]
We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings.<n>Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline.<n> internal citations complement external ones by making the model more robust to retrieval noise.
arXiv Detail & Related papers (2025-06-21T04:48:05Z) - CiteFusion: An Ensemble Framework for Citation Intent Classification Harnessing Dual-Model Binary Couples and SHAP Analyses [1.6695303704829412]
CiteFusion addresses the multi-class Citation Intent Classification task on two benchmark datasets: SciCite and ACL-ARC.<n>Experiments show that CiteFusion achieves state-of-the-art performance, with Macro-F1 scores of 89.60% on SciCite, and 76.24% on ACL-ARC.<n>We release a web-based application that classifies citation intents leveraging the CiteFusion models developed on SciCite.
arXiv Detail & Related papers (2024-07-18T09:29:33Z) - A General and Flexible Multi-concept Parsing Framework for Multilingual Semantic Matching [60.51839859852572]
We propose to resolve the text into multi concepts for multilingual semantic matching to liberate the model from the reliance on NER models.
We conduct comprehensive experiments on English datasets QQP and MRPC, and Chinese dataset Medical-SM.
arXiv Detail & Related papers (2024-03-05T13:55:16Z) - Many-Class Text Classification with Matching [65.74328417321738]
We formulate textbfText textbfClassification as a textbfMatching problem between the text and the labels, and propose a simple yet effective framework named TCM.
Compared with previous text classification approaches, TCM takes advantage of the fine-grained semantic information of the classification labels.
arXiv Detail & Related papers (2022-05-23T15:51:19Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Document-Level Relation Extraction with Sentences Importance Estimation
and Focusing [52.069206266557266]
Document-level relation extraction (DocRE) aims to determine the relation between two entities from a document of multiple sentences.
We propose a Sentence Estimation and Focusing (SIEF) framework for DocRE, where we design a sentence importance score and a sentence focusing loss.
Experimental results on two domains show that our SIEF not only improves overall performance, but also makes DocRE models more robust.
arXiv Detail & Related papers (2022-04-27T03:20:07Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.