LLmFPCA-detect: LLM-powered Multivariate Functional PCA for Anomaly Detection in Sparse Longitudinal Texts
- URL: http://arxiv.org/abs/2512.14604v1
- Date: Tue, 16 Dec 2025 17:14:10 GMT
- Title: LLmFPCA-detect: LLM-powered Multivariate Functional PCA for Anomaly Detection in Sparse Longitudinal Texts
- Authors: Prasanjit Dubey, Aritra Guha, Zhengyi Zhou, Qiong Wu, Xiaoming Huo, Paromita Dubey,
- Abstract summary: We introduce LLmFPCA-detect, a flexible framework that pairs text embeddings with functional data analysis.<n>First, LLmFPCA-detect embeds each piece of text into an application-specific numeric space.<n>We show that cluster-specific functional PC scores from LLmFPCA-detect, used as features in existing pipelines, help boost prediction performance.
- Score: 14.399535070837912
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse longitudinal (SL) textual data arises when individuals generate text repeatedly over time (e.g., customer reviews, occasional social media posts, electronic medical records across visits), but the frequency and timing of observations vary across individuals. These complex textual data sets have immense potential to inform future policy and targeted recommendations. However, because SL text data lack dedicated methods and are noisy, heterogeneous, and prone to anomalies, detecting and inferring key patterns is challenging. We introduce LLmFPCA-detect, a flexible framework that pairs LLM-based text embeddings with functional data analysis to detect clusters and infer anomalies in large SL text datasets. First, LLmFPCA-detect embeds each piece of text into an application-specific numeric space using LLM prompts. Sparse multivariate functional principal component analysis (mFPCA) conducted in the numeric space forms the workhorse to recover primary population characteristics, and produces subject-level scores which, together with baseline static covariates, facilitate data segmentation, unsupervised anomaly detection and inference, and enable other downstream tasks. In particular, we leverage LLMs to perform dynamic keyword profiling guided by the data segments and anomalies discovered by LLmFPCA-detect, and we show that cluster-specific functional PC scores from LLmFPCA-detect, used as features in existing pipelines, help boost prediction performance. We support the stability of LLmFPCA-detect with experiments and evaluate it on two different applications using public datasets, Amazon customer-review trajectories, and Wikipedia talk-page comment streams, demonstrating utility across domains and outperforming state-of-the-art baselines.
Related papers
- LLM as an Algorithmist: Enhancing Anomaly Detectors via Programmatic Synthesis [40.82779720776548]
Large Language Models (LLMs) show remarkable reasoning capabilities.<n>Our framework repositions the LLM from a data processor'' to an algorithmist''
arXiv Detail & Related papers (2025-10-04T19:00:51Z) - Vision Function Layer in Multimodal LLMs [70.42157905484765]
This study identifies visual-related functional decoding is distributed across different decoder layers in Multimodal Large Language Models (MLLMs)<n>Visual Token Swapping modifies targeted KV cache entries to precisely elucidate layer-specific functions during decoding.<n>This study delivers deeper comprehension of MLLM visual processing, fostering the creation of more efficient, interpretable, and robust models.
arXiv Detail & Related papers (2025-09-29T13:45:35Z) - Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding [27.02879006439693]
This work performs a comprehensive empirical study and introduces a benchmark for text anomaly detection.<n>Our work systematically evaluates the effectiveness of embedding-based text anomaly detection.<n>By open-sourcing our benchmark toolkit, this work provides a foundation for future research in robust and scalable text anomaly detection systems.
arXiv Detail & Related papers (2025-07-16T14:47:41Z) - Hallucination Detection in LLMs with Topological Divergence on Attention Graphs [60.83579255387347]
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models.<n>We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting.
arXiv Detail & Related papers (2025-04-14T10:06:27Z) - Are Large Language Models Useful for Time Series Data Analysis? [3.44393516559102]
Time series data plays a critical role across diverse domains such as healthcare, energy, and finance.<n>This study investigates whether large language models (LLMs) are effective for time series data analysis.
arXiv Detail & Related papers (2024-12-16T02:47:44Z) - Vulnerability of LLMs to Vertically Aligned Text Manipulations [130.54118945532898]
Vertical text input is commonly encountered in various real-world applications, such as mathematical computations and word-based Sudoku puzzles.<n>Recent research demonstrates that modifying input formats, such as vertically aligning words for encoder-based models, can substantially lower accuracy in text classification tasks.
arXiv Detail & Related papers (2024-10-26T00:16:08Z) - Training-free LLM-generated Text Detection by Mining Token Probability Sequences [18.955509967889782]
Large language models (LLMs) have demonstrated remarkable capabilities in generating high-quality texts across diverse domains.
Training-free methods, which focus on inherent discrepancies through carefully designed statistical features, offer improved generalization and interpretability.
We introduce a novel training-free detector, termed textbfLastde that synergizes local and global statistics for enhanced detection.
arXiv Detail & Related papers (2024-10-08T14:23:45Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Anomaly Detection of Tabular Data Using LLMs [54.470648484612866]
We show that pre-trained large language models (LLMs) are zero-shot batch-level anomaly detectors.
We propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies.
arXiv Detail & Related papers (2024-06-24T04:17:03Z) - MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization [86.61052121715689]
MatPlotAgent is a model-agnostic framework designed to automate scientific data visualization tasks.
MatPlotBench is a high-quality benchmark consisting of 100 human-verified test cases.
arXiv Detail & Related papers (2024-02-18T04:28:28Z) - Learning summary features of time series for likelihood free inference [93.08098361687722]
We present a data-driven strategy for automatically learning summary features from time series data.
Our results indicate that learning summary features from data can compete and even outperform LFI methods based on hand-crafted values.
arXiv Detail & Related papers (2020-12-04T19:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.