Web Content Filtering through knowledge distillation of Large Language
Models
- URL: http://arxiv.org/abs/2305.05027v2
- Date: Wed, 10 May 2023 08:36:57 GMT
- Title: Web Content Filtering through knowledge distillation of Large Language
Models
- Authors: Tam\'as V\"or\"os, Sean Paul Bergeron, Konstantin Berlin
- Abstract summary: We introduce a state-of-the-art approach for URL categorization that leverages the power of Large Language Models (LLMs)
Our method utilizes LLMs to generate accurate classifications and then employs established knowledge distillation techniques to create smaller, more specialized student models tailored for web content filtering.
Our student model matches the performance of the teacher LLM with 175 times less parameters, allowing the model to be used for in-line scanning of large volumes of URLs.
- Score: 1.7446104539598901
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce a state-of-the-art approach for URL categorization that
leverages the power of Large Language Models (LLMs) to address the primary
objectives of web content filtering: safeguarding organizations from legal and
ethical risks, limiting access to high-risk or suspicious websites, and
fostering a secure and professional work environment. Our method utilizes LLMs
to generate accurate classifications and then employs established knowledge
distillation techniques to create smaller, more specialized student models
tailored for web content filtering. Distillation results in a student model
with a 9% accuracy rate improvement in classifying websites, sourced from
customer telemetry data collected by a large security vendor, into 30 distinct
content categories based on their URLs, surpassing the current state-of-the-art
approach. Our student model matches the performance of the teacher LLM with 175
times less parameters, allowing the model to be used for in-line scanning of
large volumes of URLs, and requires 3 orders of magnitude less manually labeled
training data than the current state-of-the-art approach. Depending on the
specific use case, the output generated by our approach can either be directly
returned or employed as a pre-filter for more resource-intensive operations
involving website images or HTML.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Evaluating Large Language Model based Personal Information Extraction and Countermeasures [63.91918057570824]
Large language model (LLM) can be misused by attackers to accurately extract various personal information from personal profiles.
LLM outperforms conventional methods at such extraction.
prompt injection can mitigate such risk to a large extent and outperforms conventional countermeasures.
arXiv Detail & Related papers (2024-08-14T04:49:30Z) - Assessing In-context Learning and Fine-tuning for Topic Classification of German Web Data [3.2771631221674333]
We model the detection of topic-related content as a binary classification task.
Using only a few hundred annotated data points per topic, we detect content related to three German policies.
arXiv Detail & Related papers (2024-07-23T14:31:59Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - Large Language Model-guided Document Selection [23.673690115025913]
Large Language Model (LLM) pre-training exhausts an ever growing compute budget.
Recent research has demonstrated that careful document selection enables comparable model quality with only a fraction of the FLOPs.
We explore a promising direction for scalable general-domain document selection.
arXiv Detail & Related papers (2024-06-07T04:52:46Z) - LOLA: LLM-Assisted Online Learning Algorithm for Content Experiments [2.2021543101231167]
This paper introduces the LLM-Assisted Online Learning Algorithm (LOLA)
LOLA integrates Large Language Models (LLMs) with adaptive experimentation to optimize content delivery.
Our numerical experiments on Upworthy data show LOLA outperforms the standard A/B test method.
arXiv Detail & Related papers (2024-06-03T07:56:58Z) - The Web Can Be Your Oyster for Improving Large Language Models [98.72358969495835]
Large language models (LLMs) encode a large amount of world knowledge.
We consider augmenting LLMs with the large-scale web using search engine.
We present a web-augmented LLM UNIWEB, which is trained over 16 knowledge-intensive tasks in a unified text-to-text format.
arXiv Detail & Related papers (2023-05-18T14:20:32Z) - Learning Customized Visual Models with Retrieval-Augmented Knowledge [104.05456849611895]
We propose REACT, a framework to acquire the relevant web knowledge to build customized visual models for target domains.
We retrieve the most relevant image-text pairs from the web-scale database as external knowledge, and propose to customize the model by only training new modualized blocks while freezing all the original weights.
The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings.
arXiv Detail & Related papers (2023-01-17T18:59:06Z) - Classification of URL bitstreams using Bag of Bytes [3.2204506933585026]
In this paper, we apply a mechanical approach to generate feature vectors from URL strings.
Our approach achieved 23% better accuracy compared to the existing DL-based approach.
arXiv Detail & Related papers (2021-11-11T07:43:45Z) - Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.