Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization
- URL: http://arxiv.org/abs/2510.13885v1
- Date: Tue, 14 Oct 2025 02:15:01 GMT
- Title: Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization
- Authors: Ariel Kamen,
- Abstract summary: This study presents a comparative evaluation of ten state-of-the-art large language models (LLMs) applied to unstructured text categorization.<n>The analysis employed a uniform dataset of 8,660 human-annotated samples and identical zero-shot prompts to ensure methodological consistency across all models.<n>Results show that contemporary LLMs achieve only moderate classic performance, with average scores of 34% accuracy, 42% precision, 45% recall, and 41% F1-score.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This study presents a comparative evaluation of ten state-of-the-art large language models (LLMs) applied to unstructured text categorization using the Interactive Advertising Bureau (IAB) 2.2 hierarchical taxonomy. The analysis employed a uniform dataset of 8,660 human-annotated samples and identical zero-shot prompts to ensure methodological consistency across all models. Evaluation metrics included four classic measures - accuracy, precision, recall, and F1-score - and three LLM-specific indicators: hallucination ratio, inflation ratio, and categorization cost. Results show that, despite their rapid advancement, contemporary LLMs achieve only moderate classic performance, with average scores of 34% accuracy, 42% precision, 45% recall, and 41% F1-score. Hallucination and inflation ratios reveal that models frequently overproduce categories relative to human annotators. Among the evaluated systems, Gemini 1.5/2.0 Flash and GPT 20B/120B offered the most favorable cost-to-performance balance, while GPT 120B demonstrated the lowest hallucination ratio. The findings suggest that scaling and architectural improvements alone do not ensure better categorization accuracy, as the task requires compressing rich unstructured text into a limited taxonomy - a process that challenges current model architectures. To address these limitations, a separate ensemble-based approach was developed and tested. The ensemble method, in which multiple LLMs act as independent experts, substantially improved accuracy, reduced inflation, and completely eliminated hallucinations. These results indicate that coordinated orchestration of models - rather than sheer scale - may represent the most effective path toward achieving or surpassing human-expert performance in large-scale text categorization.
Related papers
- Evaluating the Ability of Large Language Models to Identify Adherence to CONSORT Reporting Guidelines in Randomized Controlled Trials: A Methodological Evaluation Study [7.142913983218931]
This study aimed to evaluate the accuracy and reliability of contemporary LLMs in identifying adherence of published RCTs.<n>We constructed a golden standard dataset of 150 published RCTs spanning diverse medical specialties.<n>Top-performing models, Gemini-2.5-Flash and DeepSeek-R1, achieved nearly identical macro F1 scores of 0.634 and Cohen's Kappa coefficients of 0.280 and 0.282, respectively.
arXiv Detail & Related papers (2025-11-17T08:05:15Z) - An Analysis of Architectural Impact on LLM-based Abstract Visual Reasoning: A Systematic Benchmark on RAVEN-FAIR [0.0]
GPT-4.1-Mini consistently achieved the highest overall accuracy across all architectures.<n>Each model exhibited distinct sensitivity patterns to architectural design, underscoring that reasoning effectiveness remains model-specific.
arXiv Detail & Related papers (2025-11-14T22:50:22Z) - Majority Rules: LLM Ensemble is a Winning Approach for Content Categorization [0.0]
This study introduces an ensemble framework for unstructured text categorization using large language models (LLMs)<n>By integrating multiple models, the ensemble large language model (eLLM) framework addresses common weaknesses of individual systems.<n>eLLM achieves near human-expert-level performance, offering a scalable and reliable solution for taxonomy-based classification.
arXiv Detail & Related papers (2025-11-11T05:10:09Z) - Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data? [82.09573568241724]
EssenceBench is a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA)<n>Our approach yields superior compression results with lower reconstruction error and markedly higher efficiency.<n>On the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25x fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200x fewer samples.
arXiv Detail & Related papers (2025-10-12T05:38:10Z) - Advanced Multi-Architecture Deep Learning Framework for BIRADS-Based Mammographic Image Retrieval: Comprehensive Performance Analysis with Super-Ensemble Optimization [0.0]
mammographic image retrieval systems require exact BIRADS categorical matching across five distinct classes.<n>Current medical image retrieval studies suffer from methodological limitations.
arXiv Detail & Related papers (2025-08-06T18:05:18Z) - Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations? [37.703287009808896]
Finetuning can cause spurious correlations to arise between non-essential features and the target labels.<n>We develop a benchmark by sourcing GPT-4o errors on real-world visual-question-answering (VQA) benchmarks.<n>We evaluate 15 open and closed-source LVLMs on SpuriVerse, finding that even state-of-the-art closed-source models struggle significantly.
arXiv Detail & Related papers (2025-06-23T06:11:43Z) - Prismatic Synthesis: Gradient-based Data Diversification Boosts Generalization in LLM Reasoning [77.120955854093]
We show that data diversity can be a strong predictor of generalization in language models.<n>We introduce G-Vendi, a metric that quantifies diversity via the entropy of model-induced gradients.<n>We present Prismatic Synthesis, a framework for generating diverse synthetic data.
arXiv Detail & Related papers (2025-05-26T16:05:10Z) - Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z) - Can Reasoning LLMs Enhance Clinical Document Classification? [7.026393789313748]
Large Language Models (LLMs) offer promising improvements in accuracy and efficiency for this task.<n>This study evaluates the performance and consistency of eight LLMs; four reasoning (Qwen QWQ, Deepseek Reasoner, GPT o3 Mini, Gemini 2.0 Flash Thinking) and four non-reasoning (Llama 3.3, GPT 4o Mini, Gemini 2.0 Flash, Deepseek Chat)<n>Results showed that reasoning models outperformed non-reasoning models in accuracy (71% vs 68%) and F1 score (67% vs 60%)
arXiv Detail & Related papers (2025-04-10T18:00:27Z) - A Large-Scale Study of Relevance Assessments with Large Language Models: An Initial Look [52.114284476700874]
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed.
We find that automatically generated UMBRELA judgments can replace fully manual judgments to accurately capture run-level effectiveness.
Surprisingly, we find that LLM assistance does not appear to increase correlation with fully manual assessments, suggesting that costs associated with human-in-the-loop processes do not bring obvious tangible benefits.
arXiv Detail & Related papers (2024-11-13T01:12:35Z) - Investigating the Limitation of CLIP Models: The Worst-Performing
Categories [53.360239882501325]
Contrastive Language-Image Pre-training (CLIP) provides a foundation model by integrating natural language into visual concepts.
It is usually expected that satisfactory overall accuracy can be achieved across numerous domains through well-designed textual prompts.
However, we found that their performance in the worst categories is significantly inferior to the overall performance.
arXiv Detail & Related papers (2023-10-05T05:37:33Z) - Getting More Juice Out of Your Data: Hard Pair Refinement Enhances Visual-Language Models Without Extra Data [122.282521548393]
Contrastive Language-Image Pre-training (CLIP) has become the standard for cross-modal image-text representation learning.<n>We introduce HELIP, a cost-effective strategy that improves CLIP models by exploiting challenging text-image pairs within existing datasets in continuous training.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs)
We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.