LLM4Tag: Automatic Tagging System for Information Retrieval via Large Language Models
- URL: http://arxiv.org/abs/2502.13481v1
- Date: Wed, 19 Feb 2025 07:10:23 GMT
- Title: LLM4Tag: Automatic Tagging System for Information Retrieval via Large Language Models
- Authors: Ruiming Tang, Chenxu Zhu, Bo Chen, Weipeng Zhang, Menghui Zhu, Xinyi Dai, Huifeng Guo,
- Abstract summary: Large Language Models (LLMs) have been applied in tagging systems due to their extensive world knowledge, semantic understanding, and reasoning capabilities.
Despite achieving remarkable performance, existing methods still have limitations.
A graph-based tag recall module is designed to effectively and comprehensively construct a small-scale highly relevant candidate tag set.
A knowledge-enhanced tag generation module is employed to generate accurate tags with long-term and short-term knowledge injection.
A tag confidence calibration module is introduced to generate reliable tag confidence scores.
- Score: 32.00181672539555
- License:
- Abstract: Tagging systems play an essential role in various information retrieval applications such as search engines and recommender systems. Recently, Large Language Models (LLMs) have been applied in tagging systems due to their extensive world knowledge, semantic understanding, and reasoning capabilities. Despite achieving remarkable performance, existing methods still have limitations, including difficulties in retrieving relevant candidate tags comprehensively, challenges in adapting to emerging domain-specific knowledge, and the lack of reliable tag confidence quantification. To address these three limitations above, we propose an automatic tagging system LLM4Tag. First, a graph-based tag recall module is designed to effectively and comprehensively construct a small-scale highly relevant candidate tag set. Subsequently, a knowledge-enhanced tag generation module is employed to generate accurate tags with long-term and short-term knowledge injection. Finally, a tag confidence calibration module is introduced to generate reliable tag confidence scores. Extensive experiments over three large-scale industrial datasets show that LLM4Tag significantly outperforms the state-of-the-art baselines and LLM4Tag has been deployed online for content tagging to serve hundreds of millions of users.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - TnT-LLM: Text Mining at Scale with Large Language Models [24.731544646232962]
Large Language Models (LLMs) automate the process of end-to-end label generation and assignment with minimal human effort.
We show that TnT-LLM generates more accurate and relevant label when compared against state-of-the-art baselines.
We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.
arXiv Detail & Related papers (2024-03-18T18:45:28Z) - Understanding HTML with Large Language Models [73.92747433749271]
Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks.
We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analysis of their capabilities under three tasks.
We show that LLMs pretrained on standard natural language corpora transfer remarkably well to HTML understanding tasks.
arXiv Detail & Related papers (2022-10-08T07:27:17Z) - TagRuler: Interactive Tool for Span-Level Data Programming by
Demonstration [1.4050836886292872]
Data programming was only accessible to users who knew how to program.
We build a novel tool, TagRuler, that makes it easy for annotators to build span-level labeling functions without programming.
arXiv Detail & Related papers (2021-06-24T04:49:42Z) - Generate, Annotate, and Learn: Generative Models Advance Self-Training
and Knowledge Distillation [58.64720318755764]
Semi-Supervised Learning (SSL) has seen success in many application domains, but this success often hinges on the availability of task-specific unlabeled data.
Knowledge distillation (KD) has enabled compressing deep networks and ensembles, achieving the best results when distilling knowledge on fresh task-specific unlabeled examples.
We present a general framework called "generate, annotate, and learn (GAL)" that uses unconditional generative models to synthesize in-domain unlabeled data.
arXiv Detail & Related papers (2021-06-11T05:01:24Z) - Limiting Tags Fosters Efficiency [2.6143568807090696]
We use information-theoretic measures to track the descriptive and retrieval efficiency of tags on Stack Overflow.
We observe that tagging efficiency stabilizes over time, while tag content and descriptiveness both increase.
Our work offers insights into policies to improve information organization and retrieval in online communities.
arXiv Detail & Related papers (2021-04-02T12:58:45Z) - A Survey on Recent Advances in Sequence Labeling from Deep Learning
Models [19.753741555478793]
Sequence labeling is a fundamental research problem encompassing a variety of tasks.
Deep learning has been employed for sequence labeling tasks due to its powerful capability in automatically learning complex features.
arXiv Detail & Related papers (2020-11-13T02:29:50Z) - Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models.
Self-training serves as an effective mechanism to learn from large amounts of unlabeled data.
meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.