HDNA: A graph-based change detection in HTML pages(Deface Attack
Detection)
- URL: http://arxiv.org/abs/2310.03891v1
- Date: Thu, 5 Oct 2023 20:49:54 GMT
- Title: HDNA: A graph-based change detection in HTML pages(Deface Attack
Detection)
- Authors: Mahdi Akhi, Nona Ghazizadeh
- Abstract summary: HDNA (HTML DNA) is introduced for analyzing and comparing Document Object Model (DOM) trees.
Method assigns an identifier to each HTML page based on its structure.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, a new approach called HDNA (HTML DNA) is introduced for
analyzing and comparing Document Object Model (DOM) trees in order to detect
differences in HTML pages. This method assigns an identifier to each HTML page
based on its structure, which proves to be particularly useful for detecting
variations caused by server-side updates, user interactions or potential
security risks. The process involves preprocessing the HTML content generating
a DOM tree and calculating the disparities between two or more trees. By
assigning weights to the nodes valuable insights about their hierarchical
importance are obtained. The effectiveness of the HDNA approach has been
demonstrated in identifying changes in DOM trees even when dynamically
generated content is involved. Not does this method benefit web developers,
testers, and security analysts by offering a deeper understanding of how web
pages evolve. It also helps ensure the functionality and performance of web
applications. Additionally, it enables detection and response to
vulnerabilities that may arise from modifications in DOM structures. As the web
ecosystem continues to evolve HDNA proves to be a tool, for individuals engaged
in web development, testing, or security analysis.
Related papers
- R2D2: Remembering, Reflecting and Dynamic Decision Making for Web Agents [53.94879482534949]
Current models often struggle with efficient navigation and action execution due to limited visibility and understanding of web structures.
Our proposed R2D2 framework addresses these challenges by integrating two paradigms: Remember and Reflect.
Our findings suggest that a combination of memory-enhanced navigation and reflective learning promisingly advances the capabilities of web agents.
arXiv Detail & Related papers (2025-01-21T20:21:58Z) - IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web [61.96082780724042]
We have curated and aligned a benchmark of images and corresponding web codes (IW-Bench)
We propose the Element Accuracy, which tests the completeness of the elements by parsing the Document Object Model (DOM) tree.
We also design a five-hop multimodal Chain-of-Thought Prompting for better performance.
arXiv Detail & Related papers (2024-09-14T05:38:26Z) - Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time.
Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.
We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.
We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation [54.17246674188208]
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts.
Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website.
We introduce the paradigm of generating web scrapers with large language models (LLMs) and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently.
arXiv Detail & Related papers (2024-04-19T09:59:44Z) - Dual-View Visual Contextualization for Web Navigation [36.41910428196889]
We propose to contextualize HTML elements through their "dual views" in webpage screenshots.
We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences.
The resulting representations of HTML elements are more informative for the agent to take action.
arXiv Detail & Related papers (2024-02-06T23:52:10Z) - PLM-GNN: A Webpage Classification Method based on Joint Pre-trained
Language Model and Graph Neural Network [19.75890828376791]
We propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN.
It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
arXiv Detail & Related papers (2023-05-09T12:19:10Z) - Layout-aware Webpage Quality Assessment [31.537331183733837]
We propose a novel layout-aware webpage quality assessment model currently deployed in our search engine.
We employ the meta-data that describes a webpage, i.e., Document Object Model (DOM) tree, as the input of our model.
To assess webpage quality from complex DOM tree data, we propose a graph neural network (GNN) based method.
arXiv Detail & Related papers (2023-01-28T10:27:53Z) - Black-box Dataset Ownership Verification via Backdoor Watermarking [67.69308278379957]
We formulate the protection of released datasets as verifying whether they are adopted for training a (suspicious) third-party model.
We propose to embed external patterns via backdoor watermarking for the ownership verification to protect them.
Specifically, we exploit poison-only backdoor attacks ($e.g.$, BadNets) for dataset watermarking and design a hypothesis-test-guided method for dataset verification.
arXiv Detail & Related papers (2022-08-04T05:32:20Z) - CoVA: Context-aware Visual Attention for Webpage Information Extraction [65.11609398029783]
We propose to reformulate WIE as a context-aware Webpage Object Detection task.
We develop a Context-aware Visual Attention-based (CoVA) detection pipeline which combines appearance features with syntactical structure from the DOM tree.
We show that the proposed CoVA approach is a new challenging baseline which improves upon prior state-of-the-art methods.
arXiv Detail & Related papers (2021-10-24T00:21:46Z) - Simplified DOM Trees for Transferable Attribute Extraction from the Web [15.728164692696689]
Given a web page, extracting a structured object along with various attributes of interest can facilitate a variety of downstream applications.
Existing approaches formulate the problem as a DOM tree node tagging task.
We propose a novel transferable method, SimpDOM, to tackle the problem by efficiently retrieving useful context for each node.
arXiv Detail & Related papers (2021-01-07T07:41:55Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.