Health Misinformation Detection in Web Content via Web2Vec: A Structural-, Content-based, and Context-aware Approach based on Web2Vec
- URL: http://arxiv.org/abs/2407.07914v1
- Date: Fri, 05 Jul 2024 10:33:15 GMT
- Title: Health Misinformation Detection in Web Content via Web2Vec: A Structural-, Content-based, and Context-aware Approach based on Web2Vec
- Authors: Rishabh Upadhyay, Gabriella Pasi, Marco Viviani,
- Abstract summary: We focus on Web page content, where there is still room for research to study structural-, content- and context-based features to assess the credibility of Web pages.
This work aims to study the effectiveness of such features in association with a deep learning model, starting from an embedded representation of Web pages that has been recently proposed in the context of phishing Web page detection, i.e., Web2Vec.
- Score: 3.299010876315217
- License:
- Abstract: In recent years, we have witnessed the proliferation of large amounts of online content generated directly by users with virtually no form of external control, leading to the possible spread of misinformation. The search for effective solutions to this problem is still ongoing, and covers different areas of application, from opinion spam to fake news detection. A more recently investigated scenario, despite the serious risks that incurring disinformation could entail, is that of the online dissemination of health information. Early approaches in this area focused primarily on user-based studies applied to Web page content. More recently, automated approaches have been developed for both Web pages and social media content, particularly with the advent of the COVID-19 pandemic. These approaches are primarily based on handcrafted features extracted from online content in association with Machine Learning. In this scenario, we focus on Web page content, where there is still room for research to study structural-, content- and context-based features to assess the credibility of Web pages. Therefore, this work aims to study the effectiveness of such features in association with a deep learning model, starting from an embedded representation of Web pages that has been recently proposed in the context of phishing Web page detection, i.e., Web2Vec.
Related papers
- Illusions of Relevance: Using Content Injection Attacks to Deceive Retrievers, Rerankers, and LLM Judges [52.96987928118327]
We find that embedding models for retrieval, rerankers, and large language model (LLM) relevance judges are vulnerable to content injection attacks.
We identify two primary threats: (1) inserting unrelated or harmful content within passages that still appear deceptively "relevant", and (2) inserting entire queries or key query terms into passages to boost their perceived relevance.
Our study systematically examines the factors that influence an attack's success, such as the placement of injected content and the balance between relevant and non-relevant material.
arXiv Detail & Related papers (2025-01-30T18:02:15Z) - Web Privacy based on Contextual Integrity: Measuring the Collapse of Online Contexts [0.0]
We operationalize the theory of Privacy as Contextual Integrity and measure persistent user identification within and between Web contexts.
We crawl the top-700 popular websites across the contexts of health, finance, news & media, LGBTQ, eCommerce, adult, and education websites, for 27 days.
Our findings reveal how persistent browser identification varies between and within contexts, diffusing user IDs to different distances, contrasting known tracking distributions across websites, and conducted as a joint or separate effort via cookie IDs and JS fingerprinting.
arXiv Detail & Related papers (2024-12-19T23:30:29Z) - Towards Scalable Topic Detection on Web via Simulating Levy Walks Nature of Topics in Similarity Space [55.97416108140739]
We present a novel, yet very powerful Explore-Exploit (EE) approach to group topics by simulating Levy walks nature in the similarity space.
Experiments on two public data sets demonstrate that our approach is not only comparable to the state-of-the-art methods in terms of effectiveness but also significantly outperforms the state-of-the-art methods in terms of efficiency.
arXiv Detail & Related papers (2024-07-26T07:19:46Z) - Finding Fake News Websites in the Wild [0.0860395700487494]
We propose a novel methodology for identifying websites responsible for creating and disseminating misinformation content.
We validate our approach on Twitter by examining various execution modes and contexts.
arXiv Detail & Related papers (2024-07-09T18:00:12Z) - A Responsive Framework for Research Portals Data using Semantic Web
Technology [0.6798775532273751]
The research aims to address this issue by designing a framework for the semantic organization of research portal data.
The framework focuses on the extraction of information from two specific research portals, namely Microsoft Academic and IEEE Xplore.
arXiv Detail & Related papers (2023-06-20T16:12:33Z) - Harnessing the Power of Text-image Contrastive Models for Automatic
Detection of Online Misinformation [50.46219766161111]
We develop a self-learning model to explore the constrastive learning in the domain of misinformation identification.
Our model shows the superior performance of non-matched image-text pair detection when the training data is insufficient.
arXiv Detail & Related papers (2023-04-19T02:53:59Z) - ClueWeb22: 10 Billion Web Documents with Rich Information [28.68403988636645]
ClueWeb22 provides 10 billion web pages affiliated with rich information.
Its design was influenced by the need for a high quality, large scale web corpus to support academic and industry research.
arXiv Detail & Related papers (2022-11-29T00:49:40Z) - CoVA: Context-aware Visual Attention for Webpage Information Extraction [65.11609398029783]
We propose to reformulate WIE as a context-aware Webpage Object Detection task.
We develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree.
We show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.
arXiv Detail & Related papers (2021-10-24T00:21:46Z) - A Crawler Architecture for Harvesting the Clear, Social, and Dark Web
for IoT-Related Cyber-Threat Intelligence [1.1661238776379117]
The clear, social, and dark web have lately been identified as rich sources of valuable cyber-security information.
We present a novel crawling architecture for transparently harvesting data from security websites in the clear web, security forums in the social web, and hacker forums/marketplaces in the dark web.
arXiv Detail & Related papers (2021-09-14T19:26:08Z) - Threat of Adversarial Attacks on Deep Learning in Computer Vision:
Survey II [86.51135909513047]
Deep Learning is vulnerable to adversarial attacks that can manipulate its predictions.
This article reviews the contributions made by the computer vision community in adversarial attacks on deep learning.
It provides definitions of technical terminologies for non-experts in this domain.
arXiv Detail & Related papers (2021-08-01T08:54:47Z) - Bringing Cognitive Augmentation to Web Browsing Accessibility [69.62988485669146]
We explore opportunities brought by cognitive augmentation to provide a more natural and accessible web browsing experience.
We develop a conceptual framework for supporting BVIP conversational web browsing needs.
We describe our early work and prototype that leverages that consider structural and content features only.
arXiv Detail & Related papers (2020-12-07T14:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.